Fall 2018, DSPA (HS650)
SID: 3911 UMich E-mail: zerualem@umich.edu
I certify that the following paper represents my own independent work and conforms with the guidelines of academic honesty described in the UMich student handbook.
Source: New York Times (2017)
Natural environmental systems are very complex and one such example is the ecosystems of lakes. Due to thier invaluable importance understanding the systems with the help of data-driven machine learning algorithms is important. Currently, there is large volume of data that is available for processing and analysis. We can use Data science tools on this dataset to understand the lakes and the processes. I have applied different predictive models to assess the predictability of chlorophyll concentration in lakes. The models were trained and tested resulting in a good result which can be used with some caution and also further improved up on. In addition, I have analyzed the data by using different clustering algorithms to understand the type of lakes in the United States. The results indicate that at least there are two unique clusters of lakes and clustering algorithms can also be used to classify lakes to identify their trophic states.
Data science is an exciting and a very power full field of study allowing us mine plethora of available data to help us understand phenomena that cannot be explored using other traditional methods. This is also true when it comes to environmental data, both from natural and engineered systems being very complex and often handled with process based mathematical and numerical models. This tools alone, although they are powerful and have been used successfully, using data science approaches can greatly reduce the efforts put to develop this tool and, in the area, where they fail. To demonstrate this idea, I have chosen to apply the different data science and predictive analytics methods on environmental data. Specifically, I will be working on water quality data for the continental US lakes.
The U.S. Environmental Protection Agency (USEPA) has implemented National Lakes Assessment project partnering with States and Tribes. The NLA is a national scale sampling program to understand current conditions, track environmental improvements and to inform decisions at a larger scale 1 2. In this effort, USEPA conducts field sampling for over 1000 lakes across the continental US. The sampling has been done so far for the year 2007 and 2012. The sampling included lakes (lakes, reservoirs, and ponds) within the 48 contiguous United States which have area greater than 1 ha and depth greater than 1 m. The resulting data comprises biological, chemical and human land use features 3. The data is publicly available at the USEPA’s website at the following address [https://www.epa.gov/national-aquatic-resource-surveys/nla]
Lakes are inland water bodies that one of the main sources of freshwater for human beings. Lakes have variety of social and economic uses such as water supply, fishing, water transport, recreation, and tourism. Lakes ecosystem is a very complex system which comprises physical, chemical and biological components. This complex system is highly dynamic and also vulnerable over time under the influence of natural of humans. For these reasons protecting our lakes is critical for their continued social, environmental and economic benefits.
However, the ever increasing population and urbanization are posing a serious problem to our lake systems leading to drying of lakes and impairing them with dangerous pollutants. One of the major problem in lake systems is eutrophication and loss of biodiversity 4. Even though lakes natural evolve and change their characteristics, this natural process can be further accelerated or lose its balance due to human influence. This occurs mainly when untreated wastewater are disposed directly or as runoff from urban and agricultural area through storm water. The wastewater changes the chemistry of the lakes often causing nutrient enrichment, this phenomenon is called Lake Eutrophication 5 6. This could accelerates biological growth such as algal blooms. Therefore, monitoring and evaluating the conditions of our lakes is critical for their sustainability.
One field of study that deals with the science of lakes and freshwater ecosystem dynamics is limnology. Key water quality parameters that are studies in limnology include phosphorus, nitrogen, transparency (secchi depth), chlorophyll a, dissolved oxygen and temperature 7. Of these chlorophyll a, which is a green pigment, is directly linked with the growths green algae allowing them to photosynthesize 8. Hence, Chlorophyll-a is tested in lakes to determine how much algae is in the lake. Excessive growth of algae such as harmful algal blooms (HABs) are becoming global epidemic. Once such recent incident is the case of Lake Erie during the summer 2014 cause shutdown of water supply system in Toledo, Ohio 9. Therefore, predicting chlorophyll a concentration is critical to avoid such impactful occurrences of HABs and help policy decisions.
Conventionally, complex dynamic three dimensional empirical and process based numerical models are used to understand the processes in lakes due their complexity and heterogeneity 10. These simulation models require large and diverse data in addition to computational power and often each lake is treated separately as each lake is unique. However, there is large volume data available that have not been analyzed using current Data science tools and methods. Therefore, the aim this project is to demonstrate the use of current Data science techniques for lake water quality modeling and assessment. For this I have processed and analyzed lake water quality data focusing on model development for chlorophyll a concentration and unsupervised classification of US lakes.
In general, I want to demonstrate the power of Data science tools and techniques to process, visualize and do an indepth analysis of data to again insights and understand of the system. Specifically, the main questions I would like to answer in this project include the following:
EPA provides the data for the lakes water quality on there website for free at [wwww.epa.com]. I have downloaded the csv files from this website and stored it in my local hard disk. In this section I will import several of these dataset that will be used for the analysis in this project.
library(dplyr)
library(reshape2)
# Read landused metrics
lakes_lu07 <- read.csv("data/Lakes/final/nla2007_basin_landuse_metrics.csv", header = T, stringsAsFactors = F)
dim(lakes_lu07)
## [1] 1157 59
# Read lakes dissolved oxygen data
lakes_do07 <- read.csv("data/Lakes/final/nla2007_meando_cond.csv", header = T, stringsAsFactors = F)
dim(lakes_do07)
## [1] 1252 17
# Read lakes Physical habitat condition data
lakes_phy07 <- read.csv("data/Lakes/final/nla2007_phab_indexvalues.csv", header = T, stringsAsFactors = F)
dim(lakes_phy07)
## [1] 1442 45
# Read sampled lakes information
lakes_info07 <- read.csv("data/Lakes/final/nla2007_sampledlakeinfo.csv", header = T, stringsAsFactors = F)
dim(lakes_info07)
## [1] 1252 47
# Read lakes secchi depth data
lakes_secchi07 <- read.csv("data/Lakes/final/nla2007_secchi.csv", header = T, stringsAsFactors = F)
dim(lakes_secchi07)
## [1] 1252 3
# Read lakes trophic level estimate data
lakes_troph07 <- read.csv("data/Lakes/final/nla2007_trophic_conditionestimate.csv", header = T, stringsAsFactors = F)
dim(lakes_troph07)
## [1] 1252 12
# Read lakes trophic level estimate data
lakes_chem07 <- read.csv("data/Lakes/final/EPA_NLA_CHEM07.csv", header = T, stringsAsFactors = F)
dim(lakes_chem07)
## [1] 1326 28
Next, let us join the data frames into one data frame using left_join function from dplyr package by their site id .
lakes.07 <- full_join(lakes_chem07, lakes_do07) %>% full_join(lakes_info07) %>%
full_join(lakes_phy07) %>% full_join(lakes_secchi07) %>% full_join(lakes_troph07)
dim(lakes.07)
## [1] 3652 120
length(unique(lakes.07$SITE_ID))
## [1] 1157
#write.csv(lakes.07, "lakes.07.csv")
names(lakes.07)
## [1] "SITE_ID" "DATE_COL" "DATECHEM"
## [4] "PH_FIELD" "PH_LAB" "COND"
## [7] "ANC" "TURB" "TOC"
## [10] "DOC" "NH4N_PPM" "NO3_NO2"
## [13] "NTL_PPM" "PTL" "CL_PPM"
## [16] "NO3N_PPM" "SO4_PPM" "CA_PPM"
## [19] "MG_PPM" "NA_PPM" "K_PPM"
## [22] "COLOR" "SIO2" "NH4ION"
## [25] "CATSUM" "ANSUM2" "CHLA"
## [28] "SECMEAN" "VISIT_NO" "LAT_DD"
## [31] "LON_DD" "ST" "EPA_REG"
## [34] "URBAN" "NUT_REG" "NUTREG_NAME"
## [37] "ECO_NUTA" "LAKE_ORIGIN" "ECO3_X_ORIGIN"
## [40] "REF_CLUSTER" "RT_NLA" "HUC_2"
## [43] "HUC_8" "DO2_2M" "SAMPLED"
## [46] "REPEAT" "ALBERS_X" "ALBERS_Y"
## [49] "FLD_LON_DD" "FLD_LAT_DD" "STATE_NAME"
## [52] "CNTYNAME" "NHDNAME" "LAKENAME"
## [55] "MDCATY" "WGT" "WGT_NLA"
## [58] "ADJWGT_CAT" "WSA_ECO3" "WSA_ECO9"
## [61] "ECO_LEV_3" "ECO_L3_NAM" "REFCLUS_NAME"
## [64] "REF_NUTR" "AREA_HA" "SIZE_CLASS"
## [67] "LAKEAREA" "LAKEPERIM" "SLD"
## [70] "DEPTH_X" "DEPTHMAX" "ELEV_PT"
## [73] "COM_ID" "REACHCODE" "DATEPHAB"
## [76] "CLASSP" "EcoP6" "ssiBedBld"
## [79] "ssiNATBedBld" "rviLowWood" "RVegQ_7"
## [82] "RVegQ_8" "L_RVegQ_8" "LRCVQ_7A"
## [85] "LRCVQ_7B" "LRCVQ_7C" "LRCVQ_7D"
## [88] "LRCVQ_8D" "L_LRCVQ_8D" "ElevXLat"
## [91] "ElevDLat" "ElevXLon" "XER"
## [94] "XER_X_Elev" "WMT" "RDisIXAgAdj5"
## [97] "RDisIXAgAdj4" "RDisIXAgAdj3" "RDisIXAgAdj2"
## [100] "RDis_InEx" "RDis_IX" "RvegQ_Var"
## [103] "RVegQ" "LogRVegQ" "LitCvrQ_Var"
## [106] "LitCvrQ" "LogLitCvrQ" "LitRipCVQ_Var"
## [109] "LitRipCVQ" "LogLitRipCvQ" "RVeg_OE"
## [112] "LitCvr_OE" "LitRipCvr_OE" "DATE_SECCHI"
## [115] "NTL" "CLEAR_TO_BOTTOM" "TSTATE_TP"
## [118] "TSTATE_TN" "TSTATE_CHL" "TSTATE_SECCHI"
This data has 3652 obseravations and 120 features, but only 1157 unique site ids. This implies, some of the lakes have one sampling while some have more tha sampling done in the year 2007. Thus, to be consistent and simplify the analysis we can take the average for those lakes with sampling effort more than one.
# Aggregate only the numeric features by site id
lakes.07.agg <- lakes.07 %>% group_by(SITE_ID) %>% summarise_if(is.numeric, mean, na.rm=T) %>% ungroup()
# Aggregate the character features
lakes.07.char <- lakes.07 %>% group_by(SITE_ID) %>% summarise_if(is.character, max)
# Check if the two dataframes match
length(intersect(lakes.07.agg$SITE_ID, lakes.07.char$SITE_ID))
## [1] 1157
# combine the two data frames
lakes.07.df <- data.frame(left_join(lakes.07.agg, lakes.07.char))
## Joining, by = "SITE_ID"
dim(lakes.07.df)
## [1] 1157 120
length(unique(lakes.07.df$SITE_ID))
## [1] 1157
In this section we will load the data for 2012 and similart to the 2007, we will aggregate the different data inton one data frame.
library(dplyr)
# Read lakes water chemistry data
lakes_chem12 <- read.csv("data/Lakes/final/nla2012_waterchem_wide.csv",
header = T, stringsAsFactors = F)
dim(lakes_chem12)
## [1] 1230 21
# Read lakes water chemistry profile data
lakes_prof12 <- read.csv("data/Lakes/final/nla2012_wide_profile.csv",
header = T, stringsAsFactors = F)
dim(lakes_prof12)
## [1] 30717 56
# Aggregate profile data using the average
lakes_prof_agg <- aggregate(cbind(OXYGEN, PH, TEMPERATURE, DEPTH, CONDUCTIVITY) ~ UID, FUN = mean, data=lakes_prof12)
lakes_prof_agg2 <- aggregate(cbind(INDEX_LAT_DD, INDEX_LON_DD) ~ UID, FUN = mean, data=lakes_prof12)
# Lakes site information
lakes_info12 <- read.csv("data/Lakes/final/nla2012_wide_siteinfo.csv",
header = T, stringsAsFactors = F)
# Read lakes secchi depth
lakes_secch12 <- read.csv("data/Lakes/final/nla2012_secchi.csv",
header = T, stringsAsFactors = F)
dim(lakes_secch12)
## [1] 1221 13
# Read lakes chlorophyl data
lakes_chla12 <- read.csv("data/Lakes/final/nla2012_chla_wide.csv",
header = T, stringsAsFactors = F)
dim(lakes_chla12)
## [1] 1230 7
# Read key variables data
lakes_key12 <- read.csv("data/Lakes/final/nla12_keyvariables_data.csv",
header = T, stringsAsFactors = F)
dim(lakes_key12)
## [1] 1138 50
lakes_key12small <- select(lakes_key12, UID,SITE_ID, MMI_BENT_NLA12, MMI_ZOOP_NLA6, AMITOTAL, LitCvrQc3OE, LitRipCvrQc3OE, RDis_IX, RVegQc3OE, MICX_RESULT, ATRAZINE_RESULT, TOTALHG_RESULT, METHYLMERCURY_RESULT)
# Read lakes condition data
lakes_cond12 <- read.csv("data/Lakes/final/nla_2012_condition_categories.csv",
header = T, stringsAsFactors = F)
dim(lakes_cond12)
## [1] 1038 51
# Read lakes condition data
lakes_phyto12 <- read.csv("data/Lakes/final/nla2012_wide_phytoplankton_count.csv",
header = T, stringsAsFactors = F)
dim(lakes_phyto12)
## [1] 38627 29
# Read lakes physical habitat condition data
lakes_phab12 <- read.csv("data/Lakes/final/nla2012_wide_phab.csv",
header = T, stringsAsFactors = F)
dim(lakes_phab12)
## [1] 33187 115
lakes_basinarea <- select(lakes_lu07, SITE_ID, BASINAREA_HA)
names(lakes_basinarea)[1] <- "SITEID_07"
# Merge all the data frames into one
lakes_data <- left_join(lakes_chem12,lakes_chla12, by="UID")
lakes_data <- left_join(lakes_data, lakes_cond12, by="UID")
lakes_data <- left_join(lakes_data, lakes_info12, by="UID")
lakes_data <- left_join(lakes_data, lakes_prof_agg, by="UID")
lakes_data <- left_join(lakes_data, lakes_prof_agg2, by="UID")
lakes_data <- left_join(lakes_data, lakes_secch12, by="UID")
lakes_data <- left_join(lakes_data, lakes_key12, by="UID")
#lakes_data <- left_join(lakes_data, lakes_basinarea, by="SITEID_07")
lakes_phyto <- aggregate(ABUNDANCE ~ UID, FUN = sum, data=lakes_phyto12)
lakes_data <- left_join(lakes_data, lakes_phyto)
## Joining, by = "UID"
# Remove reapted columns
lakes_data <- select(lakes_data, -ends_with(".y"))
dim(lakes_data)
## [1] 1230 170
#Clean column names
names(lakes_data) <- gsub(".x|_RESULT", "", names(lakes_data))
# Select key features
select_lakes <- subset(lakes_data, select=c(UID:TURB, CHLX, CHLL, OXYGEN, PH, TEMPERATURE, DEPTH, CONDUCTIVITY,SECCHI, AREA_HA, ELEVATION,ABUNDANCE,PERIM_KM, INDEX_LAT_DD, INDEX_LON_DD, MMI_BENT_NLA12, MMI_ZOOP_NLA6, AMITOTAL, LitCvrQc3OE, LitRipCvrQc3OE, RDis_IX, RVegQc3OE, MICX, ATRAZINE, METHYLMERCURY, TOTALHG, URBAN, LAKE_ORIGIN))
select_lakes$URBAN <- as.factor(select_lakes$URBAN)
select_lakes$LAKE_ORIGIN <- as.factor(select_lakes$LAKE_ORIGIN)
dim(select_lakes)
## [1] 1230 48
First, I will define a custom function to calculate basic summaries of the features and returns in a data frame format. Then this function will be used to do summary statistics on the data.
# Summary function
# Calculate summary statistics
summaries <- function(x){
x_summary <- matrix(0, ncol(x),5)
for (i in 1:ncol(x)) {
x_summary[i,] <- c(min(x[,i], na.rm=T),
max(x[,i], na.rm = T),
sd(x[,i], na.rm = T),
mean(x[,i], na.rm=T),
sum(is.na(x[,i])) )
}
x_summary <- data.frame(round(x_summary,3), row.names = names(x))
names(x_summary) <- c("min", "max", "sd", "mean", "NAs")
return(x_summary)
}
We can use the above defined function to do summary statistics on the aggregated data for the year 2012.
library(kableExtra)
# Let's drop TSS which has mostly missing data and reapted var PH.1
select_lakes <- select(select_lakes, -TSS, -PH.1)
lakes_summ <- summaries(select_lakes[,2:44])
lakes_summ %>% kable(digits = 3)%>% kable_styling(bootstrap_options = "striped", full_width = F, font_size = 12)
| min | max | sd | mean | NAs | |
|---|---|---|---|---|---|
| ALUMINUM | 0.000 | 3.793 | 0.145 | 0.023 | 168 |
| AMMONIA_N | 0.000 | 3.180 | 0.132 | 0.037 | 1 |
| ANC | -3361.400 | 203857.000 | 10133.962 | 2696.844 | 0 |
| CALCIUM | 0.121 | 594.900 | 45.795 | 28.325 | 0 |
| CHLORIDE | 0.036 | 18012.742 | 555.925 | 53.542 | 0 |
| COLOR | 0.000 | 840.000 | 39.935 | 25.315 | 1 |
| COND | 2.820 | 64810.000 | 3162.987 | 688.356 | 0 |
| DOC | 0.230 | 515.810 | 18.544 | 8.287 | 0 |
| MAGNESIUM | 0.067 | 1023.000 | 68.048 | 24.561 | 0 |
| NITRATE_N | 0.000 | 51.660 | 1.555 | 0.108 | 52 |
| NITRITE_N | 0.000 | 0.419 | 0.013 | 0.001 | 168 |
| NTL | 0.014 | 54.000 | 2.493 | 1.123 | 0 |
| PH | 2.830 | 10.470 | 0.845 | 8.072 | 0 |
| POTASSIUM | 0.041 | 3376.000 | 121.946 | 11.799 | 2 |
| PTL | 4.000 | 3636.000 | 269.738 | 114.750 | 0 |
| SILICA | 0.022 | 935.000 | 29.725 | 10.882 | 0 |
| SODIUM | 0.093 | 29890.000 | 1093.407 | 109.473 | 1 |
| SULFATE | 0.026 | 47325.202 | 1681.334 | 194.297 | 0 |
| TURB | 0.010 | 447.150 | 27.294 | 9.451 | 41 |
| CHLX | 0.000 | 764.640 | 54.082 | 26.118 | 5 |
| CHLL | 0.000 | 584.640 | 58.070 | 27.332 | 9 |
| OXYGEN | 0.250 | 23.686 | 2.447 | 6.751 | 170 |
| TEMPERATURE | 6.659 | 335.001 | 11.639 | 21.765 | 170 |
| DEPTH | 0.000 | 25.000 | 3.640 | 3.264 | 170 |
| CONDUCTIVITY | 1.780 | 68863.636 | 3828.352 | 773.638 | 170 |
| SECCHI | 0.020 | 28.000 | 2.311 | 2.134 | 75 |
| AREA_HA | 1.033 | 167489.609 | 7501.797 | 861.475 | 100 |
| ELEVATION | -53.270 | 3594.970 | 758.063 | 646.201 | 100 |
| ABUNDANCE | 267.000 | 67860.000 | 5494.833 | 3685.472 | 0 |
| PERIM_KM | 0.394 | 3520.774 | 125.610 | 21.095 | 100 |
| INDEX_LAT_DD | 26.070 | 48.985 | 4.950 | 40.760 | 0 |
| INDEX_LON_DD | -124.230 | -67.194 | 14.842 | -95.078 | 0 |
| MMI_BENT_NLA12 | 0.000 | 86.100 | 15.820 | 43.188 | 222 |
| MMI_ZOOP_NLA6 | 0.050 | 94.650 | 17.583 | 55.213 | 96 |
| AMITOTAL | 0.000 | 2.045 | 0.353 | 0.360 | 117 |
| LitCvrQc3OE | 0.000 | 5.636 | 0.717 | 0.881 | 117 |
| LitRipCvrQc3OE | 0.000 | 4.696 | 0.469 | 0.735 | 119 |
| RDis_IX | 0.000 | 0.950 | 0.285 | 0.488 | 124 |
| RVegQc3OE | 0.000 | 4.532 | 0.508 | 0.701 | 119 |
| MICX | 0.000 | 66.690 | 3.535 | 0.669 | 92 |
| ATRAZINE | 0.000 | 9.700 | 0.486 | 0.127 | 94 |
| METHYLMERCURY | 0.080 | 19.000 | 1.568 | 1.218 | 222 |
| TOTALHG | 4.840 | 859.730 | 93.894 | 103.161 | 227 |
AS it can be seen from the above result we have some missing values and also two of the features–Elevation and ANC(Acid neutralizing capacity) have some negative values. In the next section will try to impute the missing values using the mice and mi packages.
We can use the above defined function to do summary statistics on the aggregated data for the year 2007. The summary will be done on the numeric data.
# Summary for the numeric features
lakes07_summ <- summaries(select_if(lakes.07.df, is.numeric))
lakes07_summ %>% kable(digits = 3)%>% kable_styling(bootstrap_options = "striped", full_width = F, font_size = 12)
| min | max | sd | mean | NAs | |
|---|---|---|---|---|---|
| PH_FIELD | 4.100 | 1.030000e+01 | 8.990000e-01 | 8.080000e+00 | 9 |
| PH_LAB | 4.160 | 1.009000e+01 | 7.650000e-01 | 8.014000e+00 | 0 |
| COND | 4.350 | 5.059000e+04 | 2.470393e+03 | 6.652670e+02 | 0 |
| ANC | -62.960 | 9.163290e+04 | 5.830707e+03 | 2.620119e+03 | 0 |
| TURB | 0.147 | 5.740000e+02 | 3.413200e+01 | 1.291600e+01 | 0 |
| TOC | 0.370 | 3.250500e+02 | 2.002100e+01 | 1.001200e+01 | 0 |
| DOC | 0.340 | 2.905700e+02 | 1.690600e+01 | 8.878000e+00 | 0 |
| NH4N_PPM | 0.004 | 1.547000e+00 | 1.100000e-01 | 4.200000e-02 | 0 |
| NO3_NO2 | 0.002 | 5.592000e+00 | 3.330000e-01 | 6.900000e-02 | 0 |
| NTL_PPM | 0.010 | 2.610000e+01 | 2.154000e+00 | 1.167000e+00 | 0 |
| PTL | 0.714 | 4.753400e+03 | 2.698330e+02 | 1.073340e+02 | 0 |
| CL_PPM | 0.050 | 1.581431e+04 | 5.251390e+02 | 5.672500e+01 | 0 |
| NO3N_PPM | 0.002 | 5.409000e+00 | 3.080000e-01 | 7.300000e-02 | 0 |
| SO4_PPM | 0.083 | 4.008575e+04 | 1.528241e+03 | 1.887880e+02 | 0 |
| CA_PPM | 0.205 | 4.857000e+02 | 3.698100e+01 | 2.782800e+01 | 0 |
| MG_PPM | 0.062 | 2.466000e+03 | 1.165320e+02 | 2.706700e+01 | 0 |
| NA_PPM | 0.193 | 1.674000e+04 | 6.380970e+02 | 8.697200e+01 | 0 |
| K_PPM | 0.050 | 1.412000e+03 | 5.395700e+01 | 9.862000e+00 | 0 |
| COLOR | 0.000 | 1.650000e+02 | 1.606700e+01 | 1.618900e+01 | 0 |
| SIO2 | 0.025 | 9.190700e+01 | 1.061900e+01 | 8.783000e+00 | 0 |
| NH4ION | 0.000 | 9.496400e+01 | 6.522000e+00 | 2.561000e+00 | 0 |
| CATSUM | 28.670 | 9.415390e+05 | 3.757639e+04 | 7.653919e+03 | 0 |
| ANSUM2 | 32.840 | 9.632158e+05 | 4.069073e+04 | 8.155782e+03 | 0 |
| CHLA | 0.068 | 8.712000e+02 | 6.470200e+01 | 2.852400e+01 | 5 |
| SECMEAN | 0.040 | 3.671000e+01 | 2.500000e+00 | 2.186000e+00 | 65 |
| VISIT_NO | 1.000 | 1.556000e+00 | 1.380000e-01 | 1.041000e+00 | 0 |
| LAT_DD | 26.936 | 4.897900e+01 | 5.036000e+00 | 4.062000e+01 | 0 |
| LON_DD | -124.633 | -6.769900e+01 | 1.422700e+01 | -9.465800e+01 | 0 |
| HUC_2 | 1.000 | 1.800000e+01 | 5.011000e+00 | 8.455000e+00 | 0 |
| HUC_8 | 1010002.000 | 1.810020e+07 | 5.016659e+06 | 8.525925e+06 | 0 |
| DO2_2M | 0.800 | 2.100000e+01 | 1.863000e+00 | 7.845000e+00 | 55 |
| ALBERS_X | -2292120.601 | 2.203317e+06 | 1.162261e+06 | 1.245475e+05 | 0 |
| ALBERS_Y | -1055938.590 | 1.524035e+06 | 5.888666e+05 | 4.369279e+05 | 0 |
| FLD_LON_DD | -124.621 | -6.769900e+01 | 1.423400e+01 | -9.465500e+01 | 1 |
| FLD_LAT_DD | 26.802 | 4.898300e+01 | 5.034000e+00 | 4.061600e+01 | 1 |
| MDCATY | 0.000 | 7.600000e-01 | 6.400000e-02 | 5.500000e-02 | 0 |
| WGT | 0.000 | 7.703160e+02 | 1.781560e+02 | 8.549100e+01 | 0 |
| WGT_NLA | 0.000 | 7.248870e+02 | 9.154900e+01 | 4.304700e+01 | 0 |
| ECO_LEV_3 | 1.000 | 8.400000e+01 | 2.032700e+01 | 4.415300e+01 | 0 |
| AREA_HA | 4.037 | 1.674896e+05 | 8.020313e+03 | 1.230252e+03 | 0 |
| LAKEAREA | 0.040 | 1.674896e+03 | 8.020300e+01 | 1.230300e+01 | 0 |
| LAKEPERIM | 0.780 | 2.763418e+03 | 1.381510e+02 | 3.299900e+01 | 0 |
| SLD | 1.019 | 2.863000e+01 | 2.588000e+00 | 2.644000e+00 | 0 |
| DEPTH_X | 0.500 | 9.700000e+01 | 1.021300e+01 | 9.449000e+00 | 17 |
| DEPTHMAX | 0.500 | 9.700000e+01 | 1.026900e+01 | 9.494000e+00 | 1 |
| ELEV_PT | 0.000 | 3.403000e+03 | 6.788460e+02 | 6.076520e+02 | 0 |
| COM_ID | 0.000 | 2.453966e+07 | 7.235670e+06 | 1.184537e+07 | 0 |
| REACHCODE | 2060003.000 | 1.810020e+13 | 5.047690e+12 | 8.338619e+12 | 1 |
| ssiBedBld | 0.000 | 9.640000e-01 | 1.310000e-01 | 9.200000e-02 | 57 |
| ssiNATBedBld | 0.000 | 9.640000e-01 | 1.080000e-01 | 4.400000e-02 | 57 |
| rviLowWood | 0.000 | 1.702000e+00 | 3.270000e-01 | 4.210000e-01 | 59 |
| RVegQ_7 | 0.000 | 5.520000e-01 | 1.070000e-01 | 1.550000e-01 | 59 |
| RVegQ_8 | 0.000 | 5.750000e-01 | 1.270000e-01 | 2.160000e-01 | 68 |
| L_RVegQ_8 | -2.000 | -2.330000e-01 | 4.110000e-01 | -7.750000e-01 | 68 |
| LRCVQ_7A | 0.000 | 9.500000e-01 | 1.700000e-01 | 3.120000e-01 | 60 |
| LRCVQ_7B | 0.000 | 7.060000e-01 | 1.150000e-01 | 2.090000e-01 | 62 |
| LRCVQ_7C | 0.000 | 5.840000e-01 | 1.030000e-01 | 1.830000e-01 | 62 |
| LRCVQ_7D | 0.000 | 3.980000e-01 | 7.800000e-02 | 1.250000e-01 | 62 |
| LRCVQ_8D | 0.000 | 4.450000e-01 | 8.700000e-02 | 1.560000e-01 | 72 |
| L_LRCVQ_8D | -2.000 | -3.420000e-01 | 3.150000e-01 | -8.660000e-01 | 72 |
| ElevXLat | 0.000 | 1.412006e+05 | 2.799303e+04 | 2.570500e+04 | 57 |
| ElevDLat | 0.000 | 8.571300e+01 | 1.707300e+01 | 1.509200e+01 | 57 |
| ElevXLon | -375708.756 | 0.000000e+00 | 7.751939e+04 | -6.445280e+04 | 57 |
| XER | 0.000 | 1.000000e+00 | 2.660000e-01 | 7.600000e-02 | 56 |
| XER_X_Elev | 0.000 | 2.968800e+03 | 3.982880e+02 | 9.950400e+01 | 56 |
| WMT | 0.000 | 1.000000e+00 | 3.530000e-01 | 1.450000e-01 | 56 |
| RDisIXAgAdj5 | 0.000 | 9.450000e-01 | 2.760000e-01 | 4.740000e-01 | 57 |
| RDisIXAgAdj4 | 0.000 | 9.410000e-01 | 2.740000e-01 | 4.690000e-01 | 57 |
| RDisIXAgAdj3 | 0.000 | 9.360000e-01 | 2.720000e-01 | 4.640000e-01 | 57 |
| RDisIXAgAdj2 | 0.000 | 9.320000e-01 | 2.700000e-01 | 4.560000e-01 | 57 |
| RDis_InEx | 0.000 | 9.320000e-01 | 2.670000e-01 | 4.440000e-01 | 57 |
| RDis_IX | 0.000 | 9.450000e-01 | 2.760000e-01 | 4.740000e-01 | 57 |
| RVegQ | 0.000 | 5.580000e-01 | 1.130000e-01 | 1.710000e-01 | 60 |
| LogRVegQ | -2.000 | -2.460000e-01 | 3.810000e-01 | -8.660000e-01 | 60 |
| LitCvrQ | 0.000 | 1.105000e+00 | 1.120000e-01 | 1.220000e-01 | 60 |
| LogLitCvrQ | -2.000 | 4.700000e-02 | 3.780000e-01 | -1.029000e+00 | 60 |
| LitRipCVQ | 0.000 | 5.880000e-01 | 9.400000e-02 | 1.470000e-01 | 63 |
| LogLitRipCvQ | -2.000 | -2.230000e-01 | 3.200000e-01 | -9.020000e-01 | 63 |
| RVeg_OE | 0.000 | 3.142000e+00 | 4.760000e-01 | 7.590000e-01 | 60 |
| LitCvr_OE | 0.000 | 5.099000e+00 | 6.760000e-01 | 8.590000e-01 | 60 |
| LitRipCvr_OE | 0.000 | 3.041000e+00 | 4.380000e-01 | 7.620000e-01 | 63 |
| NTL | 5.000 | 2.610000e+04 | 2.152594e+03 | 1.166017e+03 | 0 |
summary(lakes.07.df$URBAN)
## Length Class Mode
## 1157 character character
As compared to the the 2012 data the 2007 data has fewer missing values. In the next section, I will use mice and im package to impute the missing data.
library("betareg")
library("mi")
# get show the missing information matrix
mdf <- missing_data.frame(select_lakes[,-1])
## NOTE: The following pairs of variables appear to have the same missingness pattern.
## Please verify whether they are in fact logically distinct variables.
## [,1] [,2]
## [1,] "ALUMINUM" "NITRITE_N"
## [2,] "AREA_HA" "ELEVATION"
## [3,] "AREA_HA" "PERIM_KM"
## [4,] "AREA_HA" "URBAN"
## [5,] "ELEVATION" "PERIM_KM"
## [6,] "ELEVATION" "URBAN"
## [7,] "PERIM_KM" "URBAN"
## [8,] "LitRipCvrQc3OE" "RVegQc3OE"
show(mdf)
## Object of class missing_data.frame with 1230 observations on 45 variables
##
## There are 83 missing data patterns
##
## Append '@patterns' to this missing_data.frame to access the corresponding pattern for every observation or perhaps use table()
##
## type missing method model
## ALUMINUM continuous 168 ppd linear
## AMMONIA_N continuous 1 ppd linear
## ANC continuous 0 <NA> <NA>
## CALCIUM continuous 0 <NA> <NA>
## CHLORIDE continuous 0 <NA> <NA>
## COLOR continuous 1 ppd linear
## COND continuous 0 <NA> <NA>
## DOC continuous 0 <NA> <NA>
## MAGNESIUM continuous 0 <NA> <NA>
## NITRATE_N continuous 52 ppd linear
## NITRITE_N SC_proportion 168 ppd betareg
## NITRITE_N:is_zero binary 168 ppd logit
## NTL continuous 0 <NA> <NA>
## PH continuous 0 <NA> <NA>
## POTASSIUM continuous 2 ppd linear
## PTL continuous 0 <NA> <NA>
## SILICA continuous 0 <NA> <NA>
## SODIUM continuous 1 ppd linear
## SULFATE continuous 0 <NA> <NA>
## TURB continuous 41 ppd linear
## CHLX continuous 5 ppd linear
## CHLL continuous 9 ppd linear
## OXYGEN continuous 170 ppd linear
## TEMPERATURE continuous 170 ppd linear
## DEPTH continuous 170 ppd linear
## CONDUCTIVITY continuous 170 ppd linear
## SECCHI continuous 75 ppd linear
## AREA_HA continuous 100 ppd linear
## ELEVATION continuous 100 ppd linear
## ABUNDANCE continuous 0 <NA> <NA>
## PERIM_KM continuous 100 ppd linear
## INDEX_LAT_DD continuous 0 <NA> <NA>
## INDEX_LON_DD continuous 0 <NA> <NA>
## MMI_BENT_NLA12 continuous 222 ppd linear
## MMI_ZOOP_NLA6 continuous 96 ppd linear
## AMITOTAL continuous 117 ppd linear
## LitCvrQc3OE continuous 117 ppd linear
## LitRipCvrQc3OE continuous 119 ppd linear
## RDis_IX SC_proportion 124 ppd betareg
## RDis_IX:is_zero binary 124 ppd logit
## RVegQc3OE continuous 119 ppd linear
## MICX continuous 92 ppd linear
## ATRAZINE continuous 94 ppd linear
## METHYLMERCURY continuous 222 ppd linear
## TOTALHG continuous 227 ppd linear
## URBAN binary 100 ppd logit
## LAKE_ORIGIN binary 192 ppd logit
##
## family link transformation
## ALUMINUM gaussian identity standardize
## AMMONIA_N gaussian identity standardize
## ANC <NA> <NA> standardize
## CALCIUM <NA> <NA> standardize
## CHLORIDE <NA> <NA> standardize
## COLOR gaussian identity standardize
## COND <NA> <NA> standardize
## DOC <NA> <NA> standardize
## MAGNESIUM <NA> <NA> standardize
## NITRATE_N gaussian identity standardize
## NITRITE_N binomial logit squeeze
## NITRITE_N:is_zero binomial logit <NA>
## NTL <NA> <NA> standardize
## PH <NA> <NA> standardize
## POTASSIUM gaussian identity standardize
## PTL <NA> <NA> standardize
## SILICA <NA> <NA> standardize
## SODIUM gaussian identity standardize
## SULFATE <NA> <NA> standardize
## TURB gaussian identity standardize
## CHLX gaussian identity standardize
## CHLL gaussian identity standardize
## OXYGEN gaussian identity standardize
## TEMPERATURE gaussian identity standardize
## DEPTH gaussian identity standardize
## CONDUCTIVITY gaussian identity standardize
## SECCHI gaussian identity standardize
## AREA_HA gaussian identity standardize
## ELEVATION gaussian identity standardize
## ABUNDANCE <NA> <NA> standardize
## PERIM_KM gaussian identity standardize
## INDEX_LAT_DD <NA> <NA> standardize
## INDEX_LON_DD <NA> <NA> standardize
## MMI_BENT_NLA12 gaussian identity standardize
## MMI_ZOOP_NLA6 gaussian identity standardize
## AMITOTAL gaussian identity standardize
## LitCvrQc3OE gaussian identity standardize
## LitRipCvrQc3OE gaussian identity standardize
## RDis_IX binomial logit squeeze
## RDis_IX:is_zero binomial logit <NA>
## RVegQc3OE gaussian identity standardize
## MICX gaussian identity standardize
## ATRAZINE gaussian identity standardize
## METHYLMERCURY gaussian identity standardize
## TOTALHG gaussian identity standardize
## URBAN binomial logit <NA>
## LAKE_ORIGIN binomial logit <NA>
image(mdf)
The observed missingness pattern is not at random, but it is due to the source of the data being from different data frames. Specially data points related to the physical characteristics of the lake such as elevation, lat, and lon should not be imputed, so, we can drop this missing rows.
Now we can start the imputation process for the remaining features.
select_lakes2 <- subset(select_lakes, !is.na(ELEVATION))
lakes_summ <- summaries(select_lakes2[,2:44])
mdf2 <- missing_data.frame(select_lakes2[,-1])
## NOTE: The following pairs of variables appear to have the same missingness pattern.
## Please verify whether they are in fact logically distinct variables.
## [,1] [,2]
## [1,] "ALUMINUM" "NITRITE_N"
## [2,] "LitRipCvrQc3OE" "RVegQc3OE"
show(mdf2)
## Object of class missing_data.frame with 1130 observations on 45 variables
##
## There are 64 missing data patterns
##
## Append '@patterns' to this missing_data.frame to access the corresponding pattern for every observation or perhaps use table()
##
## type missing method model
## ALUMINUM continuous 158 ppd linear
## AMMONIA_N continuous 1 ppd linear
## ANC continuous 0 <NA> <NA>
## CALCIUM continuous 0 <NA> <NA>
## CHLORIDE continuous 0 <NA> <NA>
## COLOR continuous 1 ppd linear
## COND continuous 0 <NA> <NA>
## DOC continuous 0 <NA> <NA>
## MAGNESIUM continuous 0 <NA> <NA>
## NITRATE_N continuous 50 ppd linear
## NITRITE_N SC_proportion 158 ppd betareg
## NITRITE_N:is_zero binary 158 ppd logit
## NTL continuous 0 <NA> <NA>
## PH continuous 0 <NA> <NA>
## POTASSIUM continuous 2 ppd linear
## PTL continuous 0 <NA> <NA>
## SILICA continuous 0 <NA> <NA>
## SODIUM continuous 1 ppd linear
## SULFATE continuous 0 <NA> <NA>
## TURB continuous 39 ppd linear
## CHLX continuous 5 ppd linear
## CHLL continuous 9 ppd linear
## OXYGEN continuous 155 ppd linear
## TEMPERATURE continuous 155 ppd linear
## DEPTH continuous 155 ppd linear
## CONDUCTIVITY continuous 155 ppd linear
## SECCHI continuous 67 ppd linear
## AREA_HA continuous 0 <NA> <NA>
## ELEVATION continuous 0 <NA> <NA>
## ABUNDANCE continuous 0 <NA> <NA>
## PERIM_KM continuous 0 <NA> <NA>
## INDEX_LAT_DD continuous 0 <NA> <NA>
## INDEX_LON_DD continuous 0 <NA> <NA>
## MMI_BENT_NLA12 continuous 208 ppd linear
## MMI_ZOOP_NLA6 continuous 95 ppd linear
## AMITOTAL continuous 114 ppd linear
## LitCvrQc3OE continuous 113 ppd linear
## LitRipCvrQc3OE continuous 114 ppd linear
## RDis_IX SC_proportion 120 ppd betareg
## RDis_IX:is_zero binary 120 ppd logit
## RVegQc3OE continuous 114 ppd linear
## MICX continuous 92 ppd linear
## ATRAZINE continuous 94 ppd linear
## METHYLMERCURY continuous 130 ppd linear
## TOTALHG continuous 135 ppd linear
## URBAN binary 0 <NA> <NA>
## LAKE_ORIGIN binary 100 ppd logit
##
## family link transformation
## ALUMINUM gaussian identity standardize
## AMMONIA_N gaussian identity standardize
## ANC <NA> <NA> standardize
## CALCIUM <NA> <NA> standardize
## CHLORIDE <NA> <NA> standardize
## COLOR gaussian identity standardize
## COND <NA> <NA> standardize
## DOC <NA> <NA> standardize
## MAGNESIUM <NA> <NA> standardize
## NITRATE_N gaussian identity standardize
## NITRITE_N binomial logit squeeze
## NITRITE_N:is_zero binomial logit <NA>
## NTL <NA> <NA> standardize
## PH <NA> <NA> standardize
## POTASSIUM gaussian identity standardize
## PTL <NA> <NA> standardize
## SILICA <NA> <NA> standardize
## SODIUM gaussian identity standardize
## SULFATE <NA> <NA> standardize
## TURB gaussian identity standardize
## CHLX gaussian identity standardize
## CHLL gaussian identity standardize
## OXYGEN gaussian identity standardize
## TEMPERATURE gaussian identity standardize
## DEPTH gaussian identity standardize
## CONDUCTIVITY gaussian identity standardize
## SECCHI gaussian identity standardize
## AREA_HA <NA> <NA> standardize
## ELEVATION <NA> <NA> standardize
## ABUNDANCE <NA> <NA> standardize
## PERIM_KM <NA> <NA> standardize
## INDEX_LAT_DD <NA> <NA> standardize
## INDEX_LON_DD <NA> <NA> standardize
## MMI_BENT_NLA12 gaussian identity standardize
## MMI_ZOOP_NLA6 gaussian identity standardize
## AMITOTAL gaussian identity standardize
## LitCvrQc3OE gaussian identity standardize
## LitRipCvrQc3OE gaussian identity standardize
## RDis_IX binomial logit squeeze
## RDis_IX:is_zero binomial logit <NA>
## RVegQc3OE gaussian identity standardize
## MICX gaussian identity standardize
## ATRAZINE gaussian identity standardize
## METHYLMERCURY gaussian identity standardize
## TOTALHG gaussian identity standardize
## URBAN <NA> <NA> <NA>
## LAKE_ORIGIN binomial logit <NA>
image(mdf2)
library(parallel)
detectCores()
## [1] 8
options(mc.cores = 2)
mdf2<-change(mdf2, y=c("ALUMINUM","AMMONIA_N","COLOR","NITRATE_N","NITRITE_N","POTASSIUM","SODIUM","TURB","CHLX","CHLL", "OXYGEN", "TEMPERATURE", "DEPTH","CONDUCTIVITY","SECCHI","MMI_BENT_NLA12", "MMI_ZOOP_NLA6","AMITOTAL", "LitCvrQc3OE", "LitRipCvrQc3OE", "RDis_IX","RVegQc3OE","MICX", "ATRAZINE", "METHYLMERCURY", "TOTALHG", "LAKE_ORIGIN" ), what = "imputation_method", to="pmm")
#imputations <- mi(mdf2, n.iter = 10, n.chains = 3, max.minutes = 10)
#show(imputations)
## The above code gave error
# Let's try using packge mice
library(mice)
imputed_Data <- mice(select_lakes2[,-1], m=3, maxit = 30, method = 'pmm', seed = 500, printFlag = F)
summary(imputed_Data)
## Class: mids
## Number of multiple imputations: 3
## Imputation methods:
## ALUMINUM AMMONIA_N ANC CALCIUM CHLORIDE
## "pmm" "pmm" "" "" ""
## COLOR COND DOC MAGNESIUM NITRATE_N
## "pmm" "" "" "" "pmm"
## NITRITE_N NTL PH POTASSIUM PTL
## "pmm" "" "" "pmm" ""
## SILICA SODIUM SULFATE TURB CHLX
## "" "pmm" "" "pmm" "pmm"
## CHLL OXYGEN TEMPERATURE DEPTH CONDUCTIVITY
## "pmm" "pmm" "pmm" "pmm" "pmm"
## SECCHI AREA_HA ELEVATION ABUNDANCE PERIM_KM
## "pmm" "" "" "" ""
## INDEX_LAT_DD INDEX_LON_DD MMI_BENT_NLA12 MMI_ZOOP_NLA6 AMITOTAL
## "" "" "pmm" "pmm" "pmm"
## LitCvrQc3OE LitRipCvrQc3OE RDis_IX RVegQc3OE MICX
## "pmm" "pmm" "pmm" "pmm" "pmm"
## ATRAZINE METHYLMERCURY TOTALHG URBAN LAKE_ORIGIN
## "pmm" "pmm" "pmm" "" "pmm"
## PredictorMatrix:
## ALUMINUM AMMONIA_N ANC CALCIUM CHLORIDE COLOR COND DOC MAGNESIUM
## ALUMINUM 0 1 1 1 1 1 1 1 1
## AMMONIA_N 1 0 1 1 1 1 1 1 1
## ANC 1 1 0 1 1 1 1 1 1
## CALCIUM 1 1 1 0 1 1 1 1 1
## CHLORIDE 1 1 1 1 0 1 1 1 1
## COLOR 1 1 1 1 1 0 1 1 1
## NITRATE_N NITRITE_N NTL PH POTASSIUM PTL SILICA SODIUM SULFATE
## ALUMINUM 1 1 1 1 1 1 1 1 1
## AMMONIA_N 1 1 1 1 1 1 1 1 1
## ANC 1 1 1 1 1 1 1 1 1
## CALCIUM 1 1 1 1 1 1 1 1 1
## CHLORIDE 1 1 1 1 1 1 1 1 1
## COLOR 1 1 1 1 1 1 1 1 1
## TURB CHLX CHLL OXYGEN TEMPERATURE DEPTH CONDUCTIVITY SECCHI
## ALUMINUM 1 1 1 1 1 1 1 1
## AMMONIA_N 1 1 1 1 1 1 1 1
## ANC 1 1 1 1 1 1 1 1
## CALCIUM 1 1 1 1 1 1 1 1
## CHLORIDE 1 1 1 1 1 1 1 1
## COLOR 1 1 1 1 1 1 1 1
## AREA_HA ELEVATION ABUNDANCE PERIM_KM INDEX_LAT_DD INDEX_LON_DD
## ALUMINUM 1 1 1 1 1 1
## AMMONIA_N 1 1 1 1 1 1
## ANC 1 1 0 1 1 1
## CALCIUM 1 1 1 1 1 1
## CHLORIDE 1 1 1 1 1 1
## COLOR 1 1 1 1 1 1
## MMI_BENT_NLA12 MMI_ZOOP_NLA6 AMITOTAL LitCvrQc3OE LitRipCvrQc3OE
## ALUMINUM 1 1 1 1 1
## AMMONIA_N 1 1 1 1 1
## ANC 1 1 1 1 1
## CALCIUM 1 1 1 1 1
## CHLORIDE 1 1 1 1 1
## COLOR 1 1 1 1 1
## RDis_IX RVegQc3OE MICX ATRAZINE METHYLMERCURY TOTALHG URBAN
## ALUMINUM 1 1 1 1 1 1 1
## AMMONIA_N 1 1 1 1 1 1 1
## ANC 1 1 1 1 1 1 1
## CALCIUM 1 1 1 1 1 1 1
## CHLORIDE 1 1 1 1 1 1 1
## COLOR 1 1 1 1 1 1 1
## LAKE_ORIGIN
## ALUMINUM 1
## AMMONIA_N 1
## ANC 1
## CALCIUM 1
## CHLORIDE 1
## COLOR 1
## Number of logged events: 2340
## it im dep meth out
## 1 1 1 ALUMINUM pmm SODIUM
## 2 1 1 AMMONIA_N pmm SODIUM
## 3 1 1 COLOR pmm SODIUM
## 4 1 1 NITRATE_N pmm SODIUM
## 5 1 1 NITRITE_N pmm SODIUM
## 6 1 1 POTASSIUM pmm SULFATE
densityplot(imputed_Data)
#plot(imputed_Data)
imp1 <- mice::complete(imputed_Data,1)
imp_mis <- missing_data.frame(imp1)
image(imp_mis)
imp1_sum <- summaries(imp1[,1:43])
diff1 <- imp1_sum - lakes_summ
imp2 <- mice::complete(imputed_Data,2)
imp2_sum <- summaries(imp2[,1:43])
diff2 <- imp2_sum - lakes_summ
imp3 <- mice::complete(imputed_Data,3)
imp3_sum <- summaries(imp3[,1:43])
diff3 <- imp3_sum - lakes_summ
Comparing the three imputed results they are quit similar to each other. For simplicity, let’s continue with the average of the three complete data frames. Alternatively, we can pool the final result by combining all the imputed data frames.
Next, I will repeat the above process for 2017 data using mice and mi packages. To simplify I use only select feature out of 120, which some are also duplicated.
# Select relevant features
vars.07 <- c("SITE_ID", "PH_FIELD", "PH_LAB", "COND", "ANC", "TURB", "TOC", "DOC", "NH4N_PPM", "NO3_NO2", "NTL_PPM", "PTL","CL_PPM", "NO3N_PPM", "SO4_PPM", "CA_PPM","MG_PPM" , "NA_PPM", "K_PPM", "COLOR", "SIO2", "NH4ION", "CHLA", "SECMEAN", "LAT_DD", "LON_DD", "DO2_2M", "AREA_HA", "LAKEPERIM" ,"SLD", "DEPTH_X", "DEPTHMAX", "ELEV_PT", "RDis_IX", "LitCvrQ", "LitRipCvr_OE", "RVeg_OE", "NTL","URBAN", "LAKE_ORIGIN", "REF_CLUSTER")
select.07.data <- select(lakes.07.df, vars.07)
mdf2007 <- missing_data.frame(select.07.data[,-1])
## Warning in .local(.Object, ...): PH_FIELD : some observations are NaN,
## changing to NA
## Warning in .local(.Object, ...): CHLA : some observations are NaN, changing
## to NA
## Warning in .local(.Object, ...): SECMEAN : some observations are NaN,
## changing to NA
## Warning in .local(.Object, ...): DO2_2M : some observations are NaN,
## changing to NA
## Warning in .local(.Object, ...): DEPTH_X : some observations are NaN,
## changing to NA
## Warning in .local(.Object, ...): DEPTHMAX : some observations are NaN,
## changing to NA
## Warning in .local(.Object, ...): RDis_IX : some observations are NaN,
## changing to NA
## Warning in .local(.Object, ...): LitCvrQ : some observations are NaN,
## changing to NA
## Warning in .local(.Object, ...): LitRipCvr_OE : some observations are NaN,
## changing to NA
## Warning in .local(.Object, ...): RVeg_OE : some observations are NaN,
## changing to NA
## NOTE: The following pairs of variables appear to have the same missingness pattern.
## Please verify whether they are in fact logically distinct variables.
## [,1] [,2]
## [1,] "URBAN" "LAKE_ORIGIN"
## [2,] "URBAN" "REF_CLUSTER"
## [3,] "LAKE_ORIGIN" "REF_CLUSTER"
show(mdf2007)
## Object of class missing_data.frame with 1157 observations on 40 variables
##
## There are 23 missing data patterns
##
## Append '@patterns' to this missing_data.frame to access the corresponding pattern for every observation or perhaps use table()
##
## type missing method model
## PH_FIELD continuous 9 ppd linear
## PH_LAB continuous 0 <NA> <NA>
## COND continuous 0 <NA> <NA>
## ANC continuous 0 <NA> <NA>
## TURB continuous 0 <NA> <NA>
## TOC continuous 0 <NA> <NA>
## DOC continuous 0 <NA> <NA>
## NH4N_PPM continuous 0 <NA> <NA>
## NO3_NO2 continuous 0 <NA> <NA>
## NTL_PPM continuous 0 <NA> <NA>
## PTL continuous 0 <NA> <NA>
## CL_PPM continuous 0 <NA> <NA>
## NO3N_PPM continuous 0 <NA> <NA>
## SO4_PPM continuous 0 <NA> <NA>
## CA_PPM continuous 0 <NA> <NA>
## MG_PPM continuous 0 <NA> <NA>
## NA_PPM continuous 0 <NA> <NA>
## K_PPM continuous 0 <NA> <NA>
## COLOR continuous 0 <NA> <NA>
## SIO2 continuous 0 <NA> <NA>
## NH4ION continuous 0 <NA> <NA>
## CHLA continuous 5 ppd linear
## SECMEAN continuous 65 ppd linear
## LAT_DD continuous 0 <NA> <NA>
## LON_DD continuous 0 <NA> <NA>
## DO2_2M continuous 55 ppd linear
## AREA_HA continuous 0 <NA> <NA>
## LAKEPERIM continuous 0 <NA> <NA>
## SLD continuous 0 <NA> <NA>
## DEPTH_X continuous 17 ppd linear
## DEPTHMAX continuous 1 ppd linear
## ELEV_PT continuous 0 <NA> <NA>
## RDis_IX SC_proportion 57 ppd betareg
## RDis_IX:is_zero binary 57 ppd logit
## LitCvrQ continuous 60 ppd linear
## LitRipCvr_OE continuous 63 ppd linear
## RVeg_OE continuous 60 ppd linear
## NTL continuous 0 <NA> <NA>
## URBAN binary 690 ppd logit
## LAKE_ORIGIN binary 690 ppd logit
## REF_CLUSTER unordered-categorical 690 ppd mlogit
##
## family link transformation
## PH_FIELD gaussian identity standardize
## PH_LAB <NA> <NA> standardize
## COND <NA> <NA> standardize
## ANC <NA> <NA> standardize
## TURB <NA> <NA> standardize
## TOC <NA> <NA> standardize
## DOC <NA> <NA> standardize
## NH4N_PPM <NA> <NA> standardize
## NO3_NO2 <NA> <NA> standardize
## NTL_PPM <NA> <NA> standardize
## PTL <NA> <NA> standardize
## CL_PPM <NA> <NA> standardize
## NO3N_PPM <NA> <NA> standardize
## SO4_PPM <NA> <NA> standardize
## CA_PPM <NA> <NA> standardize
## MG_PPM <NA> <NA> standardize
## NA_PPM <NA> <NA> standardize
## K_PPM <NA> <NA> standardize
## COLOR <NA> <NA> standardize
## SIO2 <NA> <NA> standardize
## NH4ION <NA> <NA> standardize
## CHLA gaussian identity standardize
## SECMEAN gaussian identity standardize
## LAT_DD <NA> <NA> standardize
## LON_DD <NA> <NA> standardize
## DO2_2M gaussian identity standardize
## AREA_HA <NA> <NA> standardize
## LAKEPERIM <NA> <NA> standardize
## SLD <NA> <NA> standardize
## DEPTH_X gaussian identity standardize
## DEPTHMAX gaussian identity standardize
## ELEV_PT <NA> <NA> standardize
## RDis_IX binomial logit squeeze
## RDis_IX:is_zero binomial logit <NA>
## LitCvrQ gaussian identity standardize
## LitRipCvr_OE gaussian identity standardize
## RVeg_OE gaussian identity standardize
## NTL <NA> <NA> standardize
## URBAN binomial logit <NA>
## LAKE_ORIGIN binomial logit <NA>
## REF_CLUSTER multinomial logit <NA>
image(mdf2007)
Unlike 2012 data we have few missing values, except for URBAN, LAKE_ORIGIN, and REF_CLUSTER features, which are not missign at random, and they will not included in the imputation process below.
#library(mice)
imputed_Data.07 <- mice(select.07.data[,2:38], m=3, maxit = 30, method = 'pmm', seed = 500, printFlag = F)
## Warning: Number of logged events: 2
lakes07_summ <- summaries(select.07.data[,2:38])
summary(imputed_Data.07)
## Class: mids
## Number of multiple imputations: 3
## Imputation methods:
## PH_FIELD PH_LAB COND ANC TURB
## "pmm" "" "" "" ""
## TOC DOC NH4N_PPM NO3_NO2 NTL_PPM
## "" "" "" "" ""
## PTL CL_PPM NO3N_PPM SO4_PPM CA_PPM
## "" "" "" "" ""
## MG_PPM NA_PPM K_PPM COLOR SIO2
## "" "" "" "" ""
## NH4ION CHLA SECMEAN LAT_DD LON_DD
## "" "pmm" "pmm" "" ""
## DO2_2M AREA_HA LAKEPERIM SLD DEPTH_X
## "pmm" "" "" "" ""
## DEPTHMAX ELEV_PT RDis_IX LitCvrQ LitRipCvr_OE
## "pmm" "" "pmm" "pmm" "pmm"
## RVeg_OE NTL
## "pmm" ""
## PredictorMatrix:
## PH_FIELD PH_LAB COND ANC TURB TOC DOC NH4N_PPM NO3_NO2 NTL_PPM
## PH_FIELD 0 1 1 1 1 1 1 1 1 1
## PH_LAB 1 0 1 1 1 1 1 1 1 1
## COND 1 1 0 1 1 1 1 1 1 1
## ANC 1 1 1 0 1 1 1 1 1 1
## TURB 1 1 1 1 0 1 1 1 1 1
## TOC 1 1 1 1 1 0 1 1 1 1
## PTL CL_PPM NO3N_PPM SO4_PPM CA_PPM MG_PPM NA_PPM K_PPM COLOR SIO2
## PH_FIELD 1 1 1 1 1 1 1 1 1 1
## PH_LAB 1 1 1 1 1 1 1 1 1 1
## COND 1 1 1 1 1 1 1 1 1 1
## ANC 1 1 1 1 1 1 1 1 1 1
## TURB 1 1 1 1 1 1 1 1 1 1
## TOC 1 1 1 1 1 1 1 1 1 1
## NH4ION CHLA SECMEAN LAT_DD LON_DD DO2_2M AREA_HA LAKEPERIM SLD
## PH_FIELD 1 1 1 1 1 1 1 1 1
## PH_LAB 1 1 1 1 1 1 1 1 1
## COND 1 1 1 1 1 1 1 1 1
## ANC 1 1 1 1 1 1 1 1 1
## TURB 1 1 1 1 1 1 1 1 1
## TOC 1 1 1 1 1 1 1 1 1
## DEPTH_X DEPTHMAX ELEV_PT RDis_IX LitCvrQ LitRipCvr_OE RVeg_OE NTL
## PH_FIELD 0 1 1 1 1 1 1 0
## PH_LAB 0 1 1 1 1 1 1 0
## COND 0 1 1 1 1 1 1 0
## ANC 0 1 1 1 1 1 1 0
## TURB 0 1 1 1 1 1 1 0
## TOC 0 1 1 1 1 1 1 0
## Number of logged events: 2
## it im dep meth out
## 1 0 0 collinear NTL
## 2 0 0 collinear DEPTH_X
#densityplot(imputed_Data.07)
Let us plot extracted the imputed dataframes and check the results summary statistics with the original data frame.
#plot(imputed_Data)
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:mice':
##
## complete
## The following object is masked from 'package:mi':
##
## complete
## The following object is masked from 'package:Matrix':
##
## expand
## The following object is masked from 'package:reshape2':
##
## smiths
library(ggplot2)
imp.07.1 <- mice::complete(imputed_Data.07,1)
imp_mis <- missing_data.frame(imp.07.1)
image(imp_mis)
imp.07.1_sum <- summaries(imp.07.1)
# diff1.7 <- imp.07.1_sum - lakes07_summ
imp.07.2 <- mice::complete(imputed_Data.07,2)
imp.07.2_sum <- summaries(imp.07.2)
diff2.07 <- imp.07.2_sum - lakes07_summ
imp.07.3 <- mice::complete(imputed_Data.07,3)
imp.07.3_sum <- summaries(imp.07.3)
diff3.07 <- imp.07.3_sum - lakes07_summ
par(mfrow=c(7,6), mai=c(.3,.3,.35,.1), omi=c(.2,.5,.5,.1) )
for (i in 1:37) {
plot(density(select.07.data[,2:38][,i], na.rm = T), main="", cex=1.2, xlab="")
mtext(names(imp.07.1)[i], bg="gray", line = 1)
lines(density(imp.07.1[,i], na.rm = T), col="tomato", lwd=1)
lines(density(imp.07.2[,i], na.rm = T), col="green", lwd=1)
lines(density(imp.07.3[,i], na.rm = T), col="blue", lwd=1)
}
mtext("Density plots of the original and the imputed values for 2007 data",
side=3, outer=T, line=1, cex=1.5, col = "darkred")
mtext("Density", side=2, outer=TRUE, line=1, cex=1.2, col="darkred")
The above density plot indicate that all of the imputed data match with each other and the original data, thus, I will take the average of the three imputed results for downstream process.
# Let's take the average of the three results
lakes07_complete <- (imp.07.1 + imp.07.2 + imp.07.3)/3.0
For the remaining sections I will focus on using mainly the 2012 data. Either I can aggument the 2012 data for prediction or I will use it to test the models.
library(reshape2)
# Let's take the average of the three results
lakes_complete <- (imp1 + imp2 + imp3)/3.0
lakes_complete$URBAN <- imp1$URBAN
lakes_complete$LAKE_ORIGIN <- imp1$LAKE_ORIGIN
final_lakes <- cbind(UID=select_lakes2$UID, lakes_complete)
#write.csv(final_lakes, "final.raw.12.csv")
final_summ <- summaries(lakes_complete[,1:43])
final_summ %>% kable()%>% kable_styling(bootstrap_options = "striped", full_width = F, font_size = 12)
| min | max | sd | mean | NAs | |
|---|---|---|---|---|---|
| ALUMINUM | 0.000 | 2.480 | 0.086 | 0.021 | 0 |
| AMMONIA_N | 0.000 | 3.180 | 0.133 | 0.037 | 0 |
| ANC | -3361.400 | 203857.000 | 10555.249 | 2781.494 | 0 |
| CALCIUM | 0.121 | 594.900 | 45.052 | 28.373 | 0 |
| CHLORIDE | 0.036 | 18012.742 | 550.409 | 50.257 | 0 |
| COLOR | 0.000 | 840.000 | 41.838 | 25.782 | 0 |
| COND | 2.820 | 64810.000 | 3233.833 | 694.122 | 0 |
| DOC | 0.230 | 515.810 | 19.272 | 8.476 | 0 |
| MAGNESIUM | 0.067 | 1023.000 | 66.561 | 24.568 | 0 |
| NITRATE_N | 0.000 | 51.660 | 1.575 | 0.105 | 0 |
| NITRITE_N | 0.000 | 0.419 | 0.013 | 0.001 | 0 |
| NTL | 0.014 | 54.000 | 2.579 | 1.142 | 0 |
| PH | 2.830 | 10.470 | 0.850 | 8.079 | 0 |
| POTASSIUM | 0.041 | 3376.000 | 127.025 | 12.351 | 0 |
| PTL | 4.000 | 3636.000 | 277.942 | 116.397 | 0 |
| SILICA | 0.022 | 935.000 | 30.861 | 10.969 | 0 |
| SODIUM | 0.093 | 29890.000 | 1134.019 | 112.550 | 0 |
| SULFATE | 0.026 | 47325.202 | 1741.119 | 200.493 | 0 |
| TURB | 0.010 | 403.600 | 24.646 | 9.200 | 0 |
| CHLX | 0.000 | 764.640 | 55.418 | 26.186 | 0 |
| CHLL | 0.000 | 584.640 | 58.786 | 26.981 | 0 |
| OXYGEN | 0.250 | 23.686 | 2.372 | 6.753 | 0 |
| TEMPERATURE | 6.659 | 109.100 | 6.673 | 21.563 | 0 |
| DEPTH | 0.000 | 25.000 | 3.524 | 3.190 | 0 |
| CONDUCTIVITY | 1.780 | 68863.636 | 3667.320 | 733.781 | 0 |
| SECCHI | 0.020 | 28.000 | 2.314 | 2.177 | 0 |
| AREA_HA | 1.033 | 167489.609 | 7501.797 | 861.475 | 0 |
| ELEVATION | -53.270 | 3594.970 | 758.063 | 646.201 | 0 |
| ABUNDANCE | 267.000 | 67860.000 | 5537.304 | 3683.193 | 0 |
| PERIM_KM | 0.394 | 3520.774 | 125.610 | 21.095 | 0 |
| INDEX_LAT_DD | 26.070 | 48.985 | 4.921 | 40.887 | 0 |
| INDEX_LON_DD | -124.230 | -67.194 | 14.773 | -95.437 | 0 |
| MMI_BENT_NLA12 | 0.000 | 86.100 | 15.269 | 43.006 | 0 |
| MMI_ZOOP_NLA6 | 0.050 | 94.650 | 17.405 | 55.777 | 0 |
| AMITOTAL | 0.000 | 1.795 | 0.337 | 0.352 | 0 |
| LitCvrQc3OE | 0.000 | 5.636 | 0.688 | 0.883 | 0 |
| LitRipCvrQc3OE | 0.000 | 4.696 | 0.453 | 0.736 | 0 |
| RDis_IX | 0.000 | 0.950 | 0.278 | 0.482 | 0 |
| RVegQc3OE | 0.000 | 4.532 | 0.497 | 0.704 | 0 |
| MICX | 0.000 | 66.690 | 3.543 | 0.675 | 0 |
| ATRAZINE | 0.000 | 9.700 | 0.484 | 0.125 | 0 |
| METHYLMERCURY | 0.080 | 19.000 | 1.541 | 1.249 | 0 |
| TOTALHG | 4.840 | 859.730 | 93.917 | 106.116 | 0 |
lakes_long <- melt(final_lakes[,1:44], id.var = "UID")
head(lakes_long)
## UID variable value
## 1 1000001 ALUMINUM 0.007
## 2 1000010 ALUMINUM 0.020
## 3 1000011 ALUMINUM 0.012
## 4 1000013 ALUMINUM 0.031
## 5 1000014 ALUMINUM 0.000
## 6 1000015 ALUMINUM 2.480
ggplot(lakes_long, aes(x=value))+
geom_histogram(aes(y=..density.., fill="chartreuse4", color="chartreuse4"), bins = 20)+
geom_density(fill="deepskyblue", alpha=0.2)+
facet_wrap(~variable, ncol = 5, scales = "free")+
theme_bw()+guides(fill=F, color=F)
From the above plot we can see that most of the histograms are highly right-skewed with most of the data having very low value and few high values. This is normal in such kind of environmental systems indicating most of the lakes have low baseline concentration and few of the lakes have high values that could be an indication of human influence or the characteristic of that specific lake. To make the data more or less similar to normal distribution we can log transform the skewed features.
In addition, some of the features have value of zero this could be that it is actually zero or could be as a result of detection limit. The EPA reports the detection limit for each feature, the detection limit for each feature varies. In order to transform we want the features to be strickly positive, and for this we will add the smallest detection limit of all the feature. As noted in the summary section the elevation and ANC features have some negative values, to deal with this I will use cube root transformation on them before using log transformation.
library(dplyr)
# Add small purturbation to the data to avoid the zeros in the data
small_c <- 0.002
final_lakes.c <- final_lakes[,2:44] + (final_lakes[,2:44] < 0.002)*small_c
# Check if this has a significant effect on the data
new_summ <- summaries(final_lakes.c)
sum_new <- new_summ - summaries(final_lakes[,2:44])
head(sum_new,10)
## min max sd mean NAs
## ALUMINUM 0.002 0 0 0.001 0
## AMMONIA_N 0.002 0 0 0.000 0
## ANC 0.002 0 0 0.000 0
## CALCIUM 0.000 0 0 0.000 0
## CHLORIDE 0.000 0 0 0.000 0
## COLOR 0.002 0 0 0.000 0
## COND 0.000 0 0 0.000 0
## DOC 0.000 0 0 0.000 0
## MAGNESIUM 0.000 0 0 0.000 0
## NITRATE_N 0.002 0 0 0.001 0
After inspecting the difference between the pruturbed and the new data, there is a slight change which is in acceptable range. Now we can start transforming the data either using log or cube root, whichever works on the specific feature. Looking at the data two features, Elevation and ANC (acid neutralizing capacity), have some negative values, and we can apply log transformation directly. Either we have to shift the data to be stricktly positive or we can try cube root transformation. First, I will try to use cube root transformation on these features, and if the result is not satisfactory, we can try to shift and log transform.
# Copy the data frame
transform_lake.df <- final_lakes.c
# Cube root transform first
cube.r <- function(x){sign(x)*abs(x)^(1/3)}
transform_lake.df$ANC <- cube.r(transform_lake.df$ANC)
transform_lake.df$ELEVATION <- cube.r(transform_lake.df$ELEVATION)
Check their histogram before and after transformation.
par(mfrow=c(2,2))
hist(final_lakes.c$ELEVATION, breaks=50, main="Original", col="#18BF5C", freq=F, xlab="Elevation")
lines(density(final_lakes.c$ELEVATION), col="darkred", lwd=3); box()
hist(transform_lake.df$ELEVATION, main="Transformed", breaks=50, col="#18BF5C", freq=F, xlab="Elevation")
lines(density(transform_lake.df$ELEVATION), col="darkred", lwd=3); box()
hist(final_lakes.c$ANC, breaks=50, main="Original", col="#18BF5C", xlab="ANC"); box()
lines(density(final_lakes.c$ANC), col="darkred", lwd=3); box()
hist(transform_lake.df$ANC, main="Transformed", breaks=50, col="#18BF5C", freq=F, xlab="ANC")
lines(density(transform_lake.df$ANC), col="darkred", lwd=3); box()
As it can be seen in the above histograms the elevation already got transformed effectively to gaussian distribution, while the ANC might still needs additional transformation, but I will leave it as it is.
# Variables that don't need transformation
ok_vars <- c("ANC","PH", "OXYGEN", "ELEVATION","INDEX_LAT_DD", "INDEX_LON_DD", "MMI_BENT_NLA12", "MMI_ZOOP_NLA6", "RDis_IX", "URBAN", "LAKE_ORIGIN")
# Variables that need transformation
log_vars <- c("ALUMINUM","AMMONIA_N", "CALCIUM", "CHLORIDE", "COLOR", "COND", "DOC", "MAGNESIUM", "NITRATE_N","NITRITE_N","NTL", "POTASSIUM", "PTL","SILICA", "SODIUM", "SULFATE","TURB", "CHLX", "CHLL", "TEMPERATURE", "DEPTH", "CONDUCTIVITY", "SECCHI", "AREA_HA", "ABUNDANCE", "PERIM_KM", "AMITOTAL", "LitCvrQc3OE", "LitRipCvrQc3OE", "RVegQc3OE", "MICX","ATRAZINE","METHYLMERCURY", "TOTALHG")
#for (i in 1:length(log_vars)) {
# transform_lake.df[,log_vars[i]] <- log(transform_lake.df[,log_vars[i]])
#}
transform_lake.df[, log_vars] <- log(dplyr::select(transform_lake.df, log_vars))
transform_lake.df$UID <- final_lakes$UID
#trans_summ <- summaries(transform_lake.df)
ts_lake_long <- melt(transform_lake.df, id.var = "UID")
ggplot(ts_lake_long, aes(x=value))+
geom_histogram(aes(y=..density.., fill="#81D2FF", color="#307D00"), bins = 10)+
geom_density(fill="#FFB05A", alpha=0.2)+
facet_wrap(~variable, ncol = 5, scales = "free")+
guides(fill=F, color=F) + theme_bw()
As it can be seen, most of the features have a unimodal normal distribution except few. For now, I am going to leave it as it is and proceed with the next section.
library(corrplot)
## corrplot 0.84 loaded
# Compute correlation table
lakes_cor <- cor(transform_lake.df[,-44])
#dev.new(width=10, height=10, unit='in')
clrs <- colorRampPalette(c("#B26205","#FFB05A", "white","#81D2FF", "#0980B2"))
corrplot(lakes_cor, type = "lower", tl.pos = "ld", mar = c(1,1,1,1),
tl.col="black", addCoef.col = "black", tl.cex = 0.9,
diag = F, method = "square",col=clrs(100), number.cex = 0.55)
Next, let us filter highly correlated features and visualize them.
# Filter those with high correlation
threshold <- 0.85
high_cor <- lakes_cor
diag(high_cor) <- 0
ok <- apply(abs(high_cor) >= threshold, 1, any)
high_cor <-lakes_cor[ok,ok]
corrplot(high_cor, type = "lower",tl.cex = 1.2, tl.pos = "ld", mar = c(1,1,1,1),
addCoef.col = "black", method = "square", number.cex = 1,
tl.col="black", col=clrs(100), diag = F)
Most of the features having high correlations are expected fron what we know about water quality parameters. For example, CHLL is littoral chlorophyl-a concentration and CHLX is chlorophyl-a concentration in the lake, and this two are expected to have high correlation. In addition, if we look at COND (condactivity of lake’s water) which is due to the ions present in water such as calcium. chloride, sodium, etc., we expect high correlation with them. A more detailed plot is also shown below.
library(psych)
high_cor_names <- colnames(high_cor)
high_cor_lakes <- subset(transform_lake.df,select = high_cor_names)
pairs.panels(high_cor_lakes,
main="Pairs plot for highly correlated features",
method = "pearson", # correlation method
hist.col = "#0980B2",
density = TRUE, # show density plots
ellipses = TRUE # show correlation ellipses
)
We can also see the exact locations of the the sampled lakes as shown below. We can zoom and hover on the points to see the total nitrogen, total phsophorus and area of the lakes.
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
#final_lakes.ts <- transform_lake.df
# geo styling
g <- list(
scope = 'usa',
projection = list(type = 'albers usa'),
showland = TRUE,
showlakes = TRUE,
landcolor = toRGB("gray95"),
subunitcolor = toRGB("gray85"),
countrycolor = toRGB("gray85"),
countrywidth = 0.5,
subunitwidth = 0.5
)
p <- plot_geo(final_lakes, lat = ~INDEX_LAT_DD, lon = ~INDEX_LON_DD, color= ~CHLX) %>%
add_markers(
text = ~paste(paste("Total nitrogen:",round(NTL,2)), paste("Total phosphorus:",round(PTL,2)), paste("Area (HA):", round(AREA_HA,2)),sep = "<br />"),
symbol = I("square"), size = ~AREA_HA, hoverinfo = "text"
) %>%
colorbar(title = "Chlorophyl a ammount<br />In the lake") %>%
layout(
title = 'US lakes<br />Surveyed in 2012', geo = g
)
## Warning: `line.width` does not currently support multiple values.
p
I have developed additional interactive visualization for the data using R shiny app, which I hosted online at [https://zerihun.shinyapps.io/US_lakes_classification/]. This app is useful for additional visualization and exploration of features relationship.
In this section, I will work on reducing the dimensionality of the data using principal component analysis. If we are able to reduce the dimension significantly to a lower dimension, we can use the new principal components to visualize and training downstream machine learning algorithms.
To do the principal component analysis I will use the transformed data, however, it is also possible to use the original data and still obtain the same result. Since all the features have different measurement units and range, I will center and scale the data before doing principal component analysis.
#lakes12 <- read.csv("lakes_data2012.csv", header = T)
lakes12 <- transform_lake.df
# The target variable is the lakes Chlorophyl concentration
chl <- lakes12$CHLL+lakes12$CHLX
lakes12$Chlorophyl <- chl
lakes12$URBAN <- final_lakes$URBAN
lakes12$LAKE_ORIGIN <- final_lakes$LAKE_ORIGIN
lakes12 <- select(lakes12, UID, everything(), -CHLL, -CHLX)
# Save the data frame to file for future use.
write.csv(lakes12, "lakes12.ts.csv")
# Let us use the numerical variables for principal component analysis
x <- lakes12[,2:42] # Deselect the response feature, URBAN, LAKE_ORIGIN and UID
pca1 <- prcomp(data.matrix(x), scale. = T, center = T)
summary(pca1)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 3.513 1.93628 1.59032 1.57528 1.3941 1.3293 1.23954
## Proportion of Variance 0.301 0.09144 0.06169 0.06052 0.0474 0.0431 0.03747
## Cumulative Proportion 0.301 0.39243 0.45412 0.51464 0.5620 0.6051 0.64261
## PC8 PC9 PC10 PC11 PC12 PC13
## Standard deviation 1.11843 1.03857 0.95748 0.9279 0.9078 0.87297
## Proportion of Variance 0.03051 0.02631 0.02236 0.0210 0.0201 0.01859
## Cumulative Proportion 0.67312 0.69943 0.72179 0.7428 0.7629 0.78148
## PC14 PC15 PC16 PC17 PC18 PC19
## Standard deviation 0.85203 0.84381 0.80187 0.77956 0.76675 0.7495
## Proportion of Variance 0.01771 0.01737 0.01568 0.01482 0.01434 0.0137
## Cumulative Proportion 0.79918 0.81655 0.83223 0.84705 0.86139 0.8751
## PC20 PC21 PC22 PC23 PC24 PC25
## Standard deviation 0.72856 0.69242 0.64569 0.63963 0.61449 0.59701
## Proportion of Variance 0.01295 0.01169 0.01017 0.00998 0.00921 0.00869
## Cumulative Proportion 0.88804 0.89973 0.90990 0.91988 0.92909 0.93778
## PC26 PC27 PC28 PC29 PC30 PC31
## Standard deviation 0.55883 0.54193 0.53672 0.49957 0.49157 0.46148
## Proportion of Variance 0.00762 0.00716 0.00703 0.00609 0.00589 0.00519
## Cumulative Proportion 0.94540 0.95256 0.95959 0.96568 0.97157 0.97677
## PC32 PC33 PC34 PC35 PC36 PC37
## Standard deviation 0.44109 0.40130 0.38495 0.34599 0.31190 0.27862
## Proportion of Variance 0.00475 0.00393 0.00361 0.00292 0.00237 0.00189
## Cumulative Proportion 0.98151 0.98544 0.98905 0.99197 0.99435 0.99624
## PC38 PC39 PC40 PC41
## Standard deviation 0.25157 0.21348 0.17156 0.12612
## Proportion of Variance 0.00154 0.00111 0.00072 0.00039
## Cumulative Proportion 0.99778 0.99889 0.99961 1.00000
#plot(pca1, type="line", main = "Scree plot")
library(factoextra)
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
fviz_eig(pca1, ncp = 20)
# Plot commulative proportion of variance explained by the PCs
pc_var <- pca1$sdev^2
pct_var <- 100*pc_var/sum(pc_var)
plot(c(0, cumsum(pct_var)), xlab = "Principal Components",
main = "Cummulative proportion of variance explained",
ylab = "% cummulative variance",
type = "o", col="dark red",
ylim = c(0,100))
We see that there is no clear “elbow” in the PCs where their contribution becomes steady. In order to get 100% of the variation we have to actually use all the 41 PCs and if we want only 95% we need to use 26 out of 41 PCs.
Since we have observed that most of the variables have low correlation, and as expected more PCAs are required to explain most of the variations in the concentration of Chlorophyll-a. At least to about 20 PCs are required to get over 80% of the variation in the chlorophl concentration, which is a high dimension by itself.
fviz_pca_var(pca1,
col.ind = "cos2", # Color by the quality of representation
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel =F, # Avoid text overlapping
select.var = list(cos2=10), # top10 according to cos val
col.var = "darkred"
)
From this plot we can see that the secchi depth is negatively correlated to both principal components. This result makes sense, because secchi depth measure the transparency depth of the lake. Hence, the higher the secchi indicates a more clean lake and hence low in chlorophyll-a concentration.
Is there any difference between manmade and naturally occurring lakes?
The next plot tries to answer if there is noticable difference or clustering between man made and natural lakes. As it can be seen from the plot there is no noticable water quality and feature difference between manmade and natural lakes.
fviz_pca_biplot(pca1,
title= "PCA biplot",
label ="var",
habillage=lakes12$LAKE_ORIGIN,
addEllipses=TRUE,
ellipse.level=0.95
)
I will start the regression process by first training a multiple linear model with all the predictor features included. This model can further be enhanced by feature selection downstream also it could be used a reference model other more complex models.
library(caret)
# Split the data to training and testing
lakes.df <- select(lakes12, -UID)
#lakes.df <- lakes.12[,2:45]
training <- caret::createDataPartition(lakes.df$URBAN, p=0.85, list = F)
lake_train<-lakes.df[training, ]
lake_test<-lakes.df[-training, ]
chl.lm <- lm(Chlorophyl ~ ., data=lake_train)
summary(chl.lm)
##
## Call:
## lm(formula = Chlorophyl ~ ., data = lake_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.1951 -0.7788 0.0229 0.7965 5.5421
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.9434254 1.9336197 1.005 0.315127
## ALUMINUM -0.1222922 0.0439032 -2.785 0.005455 **
## AMMONIA_N -0.0117464 0.0609677 -0.193 0.847263
## ANC 0.0026027 0.0212423 0.123 0.902510
## CALCIUM 0.1776504 0.0965881 1.839 0.066200 .
## CHLORIDE 0.2479604 0.0605269 4.097 4.56e-05 ***
## COLOR 0.0165917 0.0551844 0.301 0.763742
## COND -0.6206814 0.2086478 -2.975 0.003009 **
## DOC -0.2250675 0.1279957 -1.758 0.079013 .
## MAGNESIUM 0.0321815 0.0822951 0.391 0.695851
## NITRATE_N -0.2202193 0.0389125 -5.659 2.03e-08 ***
## NITRITE_N 0.1512179 0.1766684 0.856 0.392253
## NTL 1.1256084 0.1334348 8.436 < 2e-16 ***
## PH 0.1009699 0.0970588 1.040 0.298476
## POTASSIUM 0.0138769 0.0670881 0.207 0.836176
## PTL 0.3461666 0.0775577 4.463 9.07e-06 ***
## SILICA -0.0947985 0.0426004 -2.225 0.026304 *
## SODIUM -0.1089295 0.0967305 -1.126 0.260412
## SULFATE -0.0371353 0.0411725 -0.902 0.367322
## TURB 0.4811859 0.0650241 7.400 3.07e-13 ***
## OXYGEN 0.0192284 0.0221010 0.870 0.384516
## TEMPERATURE -0.2116252 0.2506554 -0.844 0.398729
## DEPTH 0.0320690 0.0538751 0.595 0.551824
## CONDUCTIVITY 0.0215893 0.0576507 0.374 0.708131
## SECCHI -0.5203569 0.1021038 -5.096 4.20e-07 ***
## AREA_HA -0.3473154 0.0941156 -3.690 0.000237 ***
## ELEVATION -0.0743355 0.0227353 -3.270 0.001117 **
## ABUNDANCE 0.5650820 0.0637629 8.862 < 2e-16 ***
## PERIM_KM 0.4769289 0.1373432 3.473 0.000540 ***
## INDEX_LAT_DD -0.0055581 0.0134181 -0.414 0.678808
## INDEX_LON_DD 0.0003126 0.0045613 0.069 0.945379
## MMI_BENT_NLA12 -0.0095538 0.0036447 -2.621 0.008904 **
## MMI_ZOOP_NLA6 0.0029671 0.0035713 0.831 0.406297
## AMITOTAL -0.0612869 0.0310069 -1.977 0.048391 *
## LitCvrQc3OE 0.0057995 0.0887265 0.065 0.947899
## LitRipCvrQc3OE 0.0027109 0.1653508 0.016 0.986923
## RDis_IX -0.1800969 0.1902459 -0.947 0.344065
## RVegQc3OE 0.0470245 0.0829375 0.567 0.570862
## MICX 0.0273046 0.0246226 1.109 0.267752
## ATRAZINE 0.0071383 0.0243428 0.293 0.769405
## METHYLMERCURY 0.0142991 0.0607254 0.235 0.813895
## TOTALHG 0.0916517 0.0889921 1.030 0.303335
## URBANYes 0.0671688 0.1144934 0.587 0.557576
## LAKE_ORIGINNATURAL -0.1220843 0.1222409 -0.999 0.318194
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.363 on 918 degrees of freedom
## Multiple R-squared: 0.7889, Adjusted R-squared: 0.779
## F-statistic: 79.78 on 43 and 918 DF, p-value: < 2.2e-16
par(mfrow=c(1,2))
plot(chl.lm, which = 1:2)
Observation on the training model
As it can be seen from the summary result on the training, there 15 features which have significant (p>0.05) linear relationship with output variable chlorophyll, but most of them (27 features) don’t seem to have linear relationship with chlorophyl concentration.
Both \(R^2\) and adjusted \(R^2\) have a moderate value close to 0.8, which can be okay compared to the complexity of the model.
The residuals plot seems reasonable accross the middle range of chlorophyll values. However, the model devaits for low and high values of chlorophyll. In addition, the qqplot indicates there are outliers at both ends, but in most part of the data, in the middle range, it look normally distributed.
The model also have AIC value of 3371.01 and BIC of 3590.12. We can use this values to compare for feature model selection comparison.
To know how strongly the linear model trained above performed, we will use the test data and evualate its performance. Later, we will use this to compare it with other models.
performance <- function(y.obs, y.pred){
model.rmse <- RMSE(y.obs, y.pred)
model.cor <- cor(y.obs, y.pred)
model.mae <- mean(abs(y.obs - y.pred))
model.r2 <- 1- sum((y.obs - y.pred)^2)/sum((y.obs - mean(y.obs))^2)
return(rbind(model.rmse, model.cor, model.mae, model.r2))
}
y.pred <- predict(chl.lm, lake_test )
# Models performance metrics
ml.perf <- performance(lake_test$Chlorophyl, y.pred)
ml.perf
## [,1]
## model.rmse 1.3691218
## model.cor 0.8915594
## model.mae 0.9881585
## model.r2 0.7944393
The model performance as expected decreased slightly on the test data, however, we still get fair performance with good correlation between the test and predicted values. The model gives a mean absolute error of 0.9881585, which corresponds to an error of \({e}^{chlorophll}=\) 2.686283 ug/L in chlorophyll concentration. Taking into consideration the range of chlorphlyll concentration, which is from 0.4406993 to 1166, this error can be considered acceptable.
From the above multiple linear regression result we have observed that only 15 features had a significant linear relationship with the response variable. Hence, we can further narrow down the number of features and select only the important features for the linear model using the step function to improve it further. I will use the
# Define a base model - intercept only
base.mod <- lm(Chlorophyl ~ 1, data=lake_train)
# Define the full model - including all predictors
all.mod <- lm(Chlorophyl ~ ., data=lake_train)
lakes_step <- step(base.mod, scope = list(lower = base.mod, upper = all.mod), direction = 'both', k=2, trace = F)
summary(lakes_step)
##
## Call:
## lm(formula = Chlorophyl ~ SECCHI + NTL + ABUNDANCE + TURB + ELEVATION +
## NITRATE_N + DOC + PTL + SILICA + MMI_BENT_NLA12 + COND +
## CHLORIDE + CALCIUM + ALUMINUM + AMITOTAL + RVegQc3OE + LAKE_ORIGIN +
## SODIUM, data = lake_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.4683 -0.7674 0.0238 0.8267 5.7185
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.391495 0.752253 1.850 0.064659 .
## SECCHI -0.501315 0.091961 -5.451 6.38e-08 ***
## NTL 1.256190 0.116888 10.747 < 2e-16 ***
## ABUNDANCE 0.568734 0.061356 9.269 < 2e-16 ***
## TURB 0.466093 0.062380 7.472 1.80e-13 ***
## ELEVATION -0.083296 0.018339 -4.542 6.29e-06 ***
## NITRATE_N -0.226141 0.032643 -6.928 7.92e-12 ***
## DOC -0.324669 0.109876 -2.955 0.003206 **
## PTL 0.338990 0.072154 4.698 3.02e-06 ***
## SILICA -0.094126 0.040224 -2.340 0.019489 *
## MMI_BENT_NLA12 -0.009974 0.003515 -2.838 0.004637 **
## COND -0.594700 0.135480 -4.390 1.26e-05 ***
## CHLORIDE 0.241610 0.053448 4.520 6.96e-06 ***
## CALCIUM 0.163125 0.080454 2.028 0.042887 *
## ALUMINUM -0.130094 0.039055 -3.331 0.000899 ***
## AMITOTAL -0.057223 0.028024 -2.042 0.041438 *
## RVegQc3OE 0.081985 0.038706 2.118 0.034424 *
## LAKE_ORIGINNATURAL -0.213220 0.099790 -2.137 0.032880 *
## SODIUM -0.136208 0.083607 -1.629 0.103617
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.364 on 943 degrees of freedom
## Multiple R-squared: 0.783, Adjusted R-squared: 0.7789
## F-statistic: 189.1 on 18 and 943 DF, p-value: < 2.2e-16
lakes_step
##
## Call:
## lm(formula = Chlorophyl ~ SECCHI + NTL + ABUNDANCE + TURB + ELEVATION +
## NITRATE_N + DOC + PTL + SILICA + MMI_BENT_NLA12 + COND +
## CHLORIDE + CALCIUM + ALUMINUM + AMITOTAL + RVegQc3OE + LAKE_ORIGIN +
## SODIUM, data = lake_train)
##
## Coefficients:
## (Intercept) SECCHI NTL
## 1.391495 -0.501315 1.256190
## ABUNDANCE TURB ELEVATION
## 0.568734 0.466093 -0.083296
## NITRATE_N DOC PTL
## -0.226141 -0.324669 0.338990
## SILICA MMI_BENT_NLA12 COND
## -0.094126 -0.009974 -0.594700
## CHLORIDE CALCIUM ALUMINUM
## 0.241610 0.163125 -0.130094
## AMITOTAL RVegQc3OE LAKE_ORIGINNATURAL
## -0.057223 0.081985 -0.213220
## SODIUM
## -0.136208
We can see that step wise linear model selection has selected 18 features/variables as relevant to the linear model, which by itself is high dimension. Next, we can calculate the importance ranking of these features and visualize it.
#library(mlbench)
# estimate variable importance
predStepwise <- varImp(lakes_step, scale=FALSE)
#Change names
rownames(predStepwise)[11] <- "URBAN"
rownames(predStepwise)[14] <- "LAKE_ORIGIN"
par(mai=c(1.5,1,0.2,0.1))
bar.df <- predStepwise[order(predStepwise, decreasing = T),]
lm.bar <- barplot(as.matrix(bar.df),
main="Important features selected by stepwise linear selection",
ylab = "Feature importance",
beside = T, ylim=c(0,12),
names.arg = rownames(predStepwise)[order(predStepwise, decreasing = T)],
las=2,
col=clrs(21),
cex.names = 0.8)
Let us now evaluate the performance of the stepwise linear model on the test data.
lm.step.pred <- predict(lakes_step, lake_test)
# Models performance metrics
lm.step.perf <- performance(lake_test$Chlorophyl, lm.step.pred )
lm.step.perf
## [,1]
## model.rmse 1.3887621
## model.cor 0.8882759
## model.mae 0.9792680
## model.r2 0.7884994
perform.table <- data.frame(lm.all=ml.perf, lm.step=lm.step.perf)
Looking at the correlation we get a reasonably good result and similar to the multiple linear regression which used all the features.
We can further check the performance of the modeling using a 10 fold cross-validation with the help of cv.lm function from DAAG package as shown below.
library(DAAG)
##
## Attaching package: 'DAAG'
## The following object is masked from 'package:psych':
##
## cities
lakes.cv.lm <- cv.lm(data=lake_train, lakes_step, m=10, printit=F, seed=30)
## Warning in cv.lm(data = lake_train, lakes_step, m = 10, printit = F, seed = 30):
##
## As there is >1 explanatory variable, cross-validation
## predicted values for a fold are not a linear function
## of corresponding overall predicted values. Lines that
## are shown for the different folds are approximate
lm.step.cv.perf <- performance(lakes.cv.lm$Chlorophyl, lakes.cv.lm$cvpred)
lm.step.cv.perf
## [,1]
## model.rmse 1.3947268
## model.cor 0.8766364
## model.mae 1.0333327
## model.r2 0.7684066
perform.table$lm.step.cv <- lm.step.cv.perf
With the 10 fold CV get even a relatively better result from the stepwise linear model with \(R^2\) now 0.7684066 and the correlation between the test values and predicted close to 0.9.
Therefore, we can concluded that multiple linear regression with the 19 features selected give a reasonable or acceptable results. However, the model performance well when the chlorophyll concentration as in the middle range. Thus, other non-linear machine learning algorthims such artificial neural network (ANN), generalized additive model (GAM), etc. might need to be explored to get imporved results.
I have done feature selection in the previous section using the stepwise approach. In this section, I will further explore the two commonly used features selection methods, namely Boruta and LASSO.
library(Boruta)
## Loading required package: ranger
set.seed(123)
lakes.boru<-Boruta(Chlorophyl~., data=lake_train, doTrace=0)
print(lakes.boru)
## Boruta performed 99 iterations in 1.272344 mins.
## 33 attributes confirmed important: ABUNDANCE, AMITOTAL,
## AMMONIA_N, ANC, ATRAZINE and 28 more;
## 8 attributes confirmed unimportant: AREA_HA, LitCvrQc3OE,
## LitRipCvrQc3OE, NITRITE_N, PERIM_KM and 3 more;
## 2 tentative attributes left: ALUMINUM, LAKE_ORIGIN;
par(mai=c(1.8,0.9,0.5,0.2))
plot(lakes.boru, xlab="", xaxt="n")
lz<-lapply(1:ncol(lakes.boru$ImpHistory), function(i)
lakes.boru$ImpHistory[is.finite(lakes.boru$ImpHistory[, i]), i])
names(lz)<-colnames(lakes.boru$ImpHistory)
lb<-sort(sapply(lz, median))
axis(side=1, las=2, labels=names(lb), at=1:ncol(lakes.boru$ImpHistory), cex.axis=0.8, font = 1)
title("Feature importance as predicted by Boruta", col.main="darkgreen")
Boruta predicts 32 of the features out of 43 as important, which did not effectively reduce the features as compared to the stepwise selection. This might be due to the Boruta’s failure to narrow it down, or it could also show that problem is really high dimensional and most of the features are important.
getConfirmedFormula(lakes.boru)
## Chlorophyl ~ AMMONIA_N + ANC + CALCIUM + CHLORIDE + COLOR + COND +
## DOC + MAGNESIUM + NITRATE_N + NTL + PH + POTASSIUM + PTL +
## SILICA + SODIUM + SULFATE + TURB + OXYGEN + TEMPERATURE +
## DEPTH + CONDUCTIVITY + SECCHI + ELEVATION + ABUNDANCE + INDEX_LAT_DD +
## INDEX_LON_DD + MMI_BENT_NLA12 + MMI_ZOOP_NLA6 + AMITOTAL +
## MICX + ATRAZINE + METHYLMERCURY + TOTALHG
## <environment: 0x0000000015b44e20>
predBoruta <- getSelectedAttributes(lakes.boru, withTentative = F)
Let us also compare the results of Boruta with the stepwise feature selection.
intersect(predBoruta, rownames(predStepwise))
## [1] "CALCIUM" "CHLORIDE" "DOC" "NITRATE_N"
## [5] "NTL" "PTL" "SILICA" "SODIUM"
## [9] "TURB" "SECCHI" "ELEVATION" "ABUNDANCE"
## [13] "MMI_BENT_NLA12" "AMITOTAL"
Here we can see that there is agreement with stepwise linear selection and boruta with 15 features commonly selected as important.
Next, we will use LASSO regularization to do the feature selection and compare it with the previous two methods.
#### 10-fold cross validation ####
# LASSO
library("glmnet")
## Loading required package: foreach
## Loaded glmnet 2.0-16
library(doParallel)
## Loading required package: iterators
registerDoParallel(3)
set.seed(1234) # set seed
# (10-fold) cross validation for the LASSO
# Select the predictor features data only
xTrain <- data.matrix(select(lake_train,-Chlorophyl))
# Binarize the categorical features
urban <- ifelse(lake_train$URBAN == "Yes", 1,0)
lake_origin <- ifelse(lake_train$LAKE_ORIGIN == "Natural", 1,0)
# Replace the categorical data with binarized data
xTrain[,42:43] <- cbind(urban,lake_origin)
dim(xTrain)
## [1] 962 43
#Do the same on the test data
xTest <- data.matrix(select(lake_test,-Chlorophyl))
urban <- ifelse(lake_test$URBAN == "Yes", 1,0)
lake_origin <- ifelse(lake_test$LAKE_ORIGIN == "Natural", 1,0)
xTest[,42:43] <- cbind(urban,lake_origin)
dim(xTest)
## [1] 168 43
yTrain <- lake_train$Chlorophyl
yTest <- lake_test$Chlorophyl
cvLASSO <- cv.glmnet(data.matrix(xTrain), yTrain, alpha = 1, parallel=TRUE)
par(mai=c(1,1,1,0.1))
plot(cvLASSO)
mtext("CV LASSO: Number of Nonzero (Active) Coefficients", side=3, line=2.5)
# Report MSE LASSO
predLASSO <- predict(cvLASSO, s = cvLASSO$lambda.1se, newx = xTest)
LASSO.perf <- performance(yTest, predLASSO)
LASSO.perf
## 1
## model.rmse 1.4310867
## 0.8836414
## model.mae 1.0736569
## model.r2 0.7754114
perform.table$LASSO <- LASSO.perf
Again here, we see that using LASSO resulted in a similar performance compared to the stepwise linear regression.
library(kableExtra)
# lambda that gives best performance on misclassification error
bestlambda <- cvLASSO$lambda.min
# extract parameters at each of these lambda values
coefs.bestLambda <- as.matrix(coef(cvLASSO, s = bestlambda))
# which coefficients are non-zero
coef.nonzero <- (data.frame(variables=rownames(coefs.bestLambda)[which(coefs.bestLambda > 0)], coefs=coefs.bestLambda[ which( coefs.bestLambda > 0)]))
coef.nonzero %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, font_size = 12)
| variables | coefs |
|---|---|
| (Intercept) | 1.0539817 |
| CALCIUM | 0.1132904 |
| CHLORIDE | 0.1966729 |
| COLOR | 0.0037958 |
| NITRITE_N | 0.1419699 |
| NTL | 1.0873324 |
| PH | 0.1144994 |
| PTL | 0.3415499 |
| TURB | 0.4916500 |
| OXYGEN | 0.0054959 |
| DEPTH | 0.0216674 |
| ABUNDANCE | 0.5628738 |
| PERIM_KM | 0.3600837 |
| INDEX_LON_DD | 0.0003866 |
| MMI_ZOOP_NLA6 | 0.0022615 |
| LitRipCvrQc3OE | 0.0239774 |
| RVegQc3OE | 0.0364415 |
| MICX | 0.0249451 |
| ATRAZINE | 0.0080291 |
| METHYLMERCURY | 0.0120455 |
| TOTALHG | 0.0590776 |
| URBAN | 0.0846511 |
# Compare the features selected using LASSO and stepwise
lasso.vars <- as.character(coef.nonzero$variables[-1])
lasso.vars
## [1] "CALCIUM" "CHLORIDE" "COLOR" "NITRITE_N"
## [5] "NTL" "PH" "PTL" "TURB"
## [9] "OXYGEN" "DEPTH" "ABUNDANCE" "PERIM_KM"
## [13] "INDEX_LON_DD" "MMI_ZOOP_NLA6" "LitRipCvrQc3OE" "RVegQc3OE"
## [17] "MICX" "ATRAZINE" "METHYLMERCURY" "TOTALHG"
## [21] "URBAN"
intersect(rownames(predStepwise), lasso.vars)
## [1] "NTL" "ABUNDANCE" "TURB" "PTL" "URBAN" "CHLORIDE"
## [7] "CALCIUM" "RVegQc3OE"
# Compare the features selected using LASSO and boruta
intersect(predBoruta, lasso.vars)
## [1] "CALCIUM" "CHLORIDE" "COLOR" "NTL"
## [5] "PH" "PTL" "TURB" "OXYGEN"
## [9] "DEPTH" "ABUNDANCE" "INDEX_LON_DD" "MMI_ZOOP_NLA6"
## [13] "MICX" "ATRAZINE" "METHYLMERCURY" "TOTALHG"
# Compare the features selected by LASSO, boruta and stepwise
comm_selected_vars <- intersect(intersect(predBoruta, rownames(predStepwise)), lasso.vars)
comm_selected_vars
## [1] "CALCIUM" "CHLORIDE" "NTL" "PTL" "TURB" "ABUNDANCE"
all_selected_vars <- union(union(predBoruta, rownames(predStepwise)), lasso.vars)
all_selected_vars
## [1] "AMMONIA_N" "ANC" "CALCIUM"
## [4] "CHLORIDE" "COLOR" "COND"
## [7] "DOC" "MAGNESIUM" "NITRATE_N"
## [10] "NTL" "PH" "POTASSIUM"
## [13] "PTL" "SILICA" "SODIUM"
## [16] "SULFATE" "TURB" "OXYGEN"
## [19] "TEMPERATURE" "DEPTH" "CONDUCTIVITY"
## [22] "SECCHI" "ELEVATION" "ABUNDANCE"
## [25] "INDEX_LAT_DD" "INDEX_LON_DD" "MMI_BENT_NLA12"
## [28] "MMI_ZOOP_NLA6" "AMITOTAL" "MICX"
## [31] "ATRAZINE" "METHYLMERCURY" "TOTALHG"
## [34] "URBAN" "LAKE_ORIGIN" "RVegQc3OE"
## [37] "LAKE_ORIGINNATURAL" "NITRITE_N" "PERIM_KM"
## [40] "LitRipCvrQc3OE"
Feature selection using LASSO result in 21 features being selected as important, which is the name of features as the stepwise selection above. However, only 9 of the features selected by both stepwise and LASSO regression, and 14 features were selected as important both by LASSO and boruta methods. Finally, only 6 features were selected commonly by the three methods. This indicate a strong disaggreement between the feature selection methods.
Next, let us see the prediction power of a linear regression only using the 6 commonly selected features.
select.vars <- c("NTL", "PTL", "TURB","MICX", "ABUNDANCE","PH", "Chlorophyl")
select.vars <- c(comm_selected_vars, "Chlorophyl")
small.lm <- lm(as.formula(paste("Chlorophyl ~ ", paste(comm_selected_vars, collapse = " + "))),
data = lake_train[, select.vars])
summary(small.lm)
##
## Call:
## lm(formula = as.formula(paste("Chlorophyl ~ ", paste(comm_selected_vars,
## collapse = " + "))), data = lake_train[, select.vars])
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5734 -0.8593 0.1054 0.9602 5.2000
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.47623 0.60608 -2.436 0.015045 *
## CALCIUM -0.22033 0.04968 -4.435 1.03e-05 ***
## CHLORIDE 0.04758 0.03416 1.393 0.163962
## NTL 0.90841 0.08583 10.584 < 2e-16 ***
## PTL 0.25650 0.06589 3.893 0.000106 ***
## TURB 0.83515 0.05074 16.458 < 2e-16 ***
## ABUNDANCE 0.62133 0.06741 9.217 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.563 on 955 degrees of freedom
## Multiple R-squared: 0.7112, Adjusted R-squared: 0.7094
## F-statistic: 392 on 6 and 955 DF, p-value: < 2.2e-16
lm.small.pred <- predict(small.lm, lake_test[, select.vars])
# Calculate the RMSE
lm.small.perf <- performance(lake_test$Chlorophyl, lm.small.pred)
lm.small.perf
## [,1]
## model.rmse 1.5559103
## model.cor 0.8583315
## model.mae 1.1863575
## model.r2 0.7345242
perform.table$lm.small <- lm.small.perf
Comparing models performance so far
models.perf <- data.frame (t(perform.table))
models.perf$models <- rownames(models.perf)
perf.long <- melt(models.perf, id.vars = "models" )
theme_set(theme_bw())
ggplot(perf.long, aes(x=models, y=value)) +
geom_point(size=5) +
geom_segment(aes(x=models, xend=models, y=0, yend=value)) +
labs(title="Models RMSE Chart") + ylab("RMSE")+
theme(axis.text.x = element_text(size=14, angle = 60, hjust = 1),
axis.text.y = element_text(size=14),
axis.title=element_text(size=18,face="bold", color = "darkred"))+
facet_wrap(~ variable)
library(fmsb)
models.perf <- data.frame (t(perform.table))
names(models.perf) <- c("RMSE", "Corr Coef", "MAE","R2")
models.perf <- rbind(rep(2,4), rep(0,4), models.perf)
col_fill <- c(rgb(0.5,0.2,0.8,0.2), rgb(0.5,0.2,0.1,0.2), rgb(0.5,0.8,0.1,0.2),rgb(0.2,0.2,0.2,0.2))
radarchart(models.perf, pfcol=col_fill)
As we can see from the results above, all the three regressions gave very similar results on all of the performance metrics chossen on the test data. In the next sections, I will try to use artificial neural network and see if we can improve the prediction better than what we obtained using the stepwise linear regression.
In this section I will train an artificial neural network, and for this I will use the keras package. For comparison, I will use the feature selected by the stepwise linear selection.
Since the features have different measuring units and wide range of values, I will standardize the numeric features first.
library(reticulate)
library(keras)
## First using only the features selected by stepwise model
# Normalize training data
step_vars <- rownames(predStepwise)
ann.vars <-c("SECCHI", "NTL", "ABUNDANCE", "TURB", "ELEVATION", "PTL", "NITRATE_N", "DOC", "SILICA", "ALUMINUM", "URBAN", "AMITOTAL", "COND", "LAKE_ORIGIN", "CALCIUM", "PH", "RDis_IX", "MMI_ZOOP_NLA6")
#ann.train.df <- select(lake_train, ann.vars)
train_std <- scale(select(lake_train[, ann.vars], -URBAN, -LAKE_ORIGIN))
dim(train_std)
## [1] 962 16
col_means_train <- attr(train_std, "scaled:center")
col_stddevs_train <- attr(train_std, "scaled:scale")
# Replace back the factors we dont want to standardize them
train_std <- data.frame(train_std,
URBAN=ifelse(lake_train$URBAN == "Yes", 1,0),
LAKE_ORIGIN =ifelse(lake_train$LAKE_ORIGIN == "Natural", 1,0)
)
test_std <- scale(select(lake_test[, ann.vars], -URBAN, -LAKE_ORIGIN),
center = col_means_train, scale = col_stddevs_train)
dim(train_std)
## [1] 962 18
# Use means and standard deviations from training set to normalize test set
# Replace back the factors we dont want to standardize them
test_std <- data.frame(test_std,
URBAN=ifelse(lake_test$URBAN == "Yes", 1,0),
LAKE_ORIGIN =ifelse(lake_test$LAKE_ORIGIN == "Natural", 1,0)
)
yTrain <- lake_train$Chlorophyl
yTest <- lake_test$Chlorophyl
Next, I will build the ANN model with two hidden layers and one output node for the regression. For start, each hidden layers will have 64 nodes with relu activation function. The output layer, since it is regression, I will use one node will a linear activation function.
trainX <- as.matrix(train_std) # Features selected by stepwise only)
build_model <- function() {
model <- keras_model_sequential() %>%
layer_dense(units = 64, activation = "relu", input_shape = dim(trainX)[2]) %>%
layer_dense(units = 64,kernel_initializer = 'normal', activation = "relu") %>%
layer_dense(units = 1, kernel_initializer = 'normal', activation = "linear")
model %>% compile(
loss = "mse",
optimizer = optimizer_rmsprop(lr=0.0001),
metrics = list("mean_absolute_error")
)
model
}
model <- build_model()
model %>% summary()
## ___________________________________________________________________________
## Layer (type) Output Shape Param #
## ===========================================================================
## dense_1 (Dense) (None, 64) 1216
## ___________________________________________________________________________
## dense_2 (Dense) (None, 64) 4160
## ___________________________________________________________________________
## dense_3 (Dense) (None, 1) 65
## ===========================================================================
## Total params: 5,441
## Trainable params: 5,441
## Non-trainable params: 0
## ___________________________________________________________________________
Next, the model will be trained for 500 epochs.
# Display training progress by printing a single dot for each completed epoch.
print_dot_callback <- callback_lambda(
on_epoch_end = function(epoch, logs) {
if (epoch %% 80 == 0) cat("\n")
cat(".")
}
)
# The patience parameter is the amount of epochs to check for improvement.
early_stop <- callback_early_stopping(monitor = "val_loss", patience = 100)
epochs <- 500
set.seed(12)
# Fit the model and store training stats
history <- model %>% fit(
trainX,
yTrain,
epochs = epochs,
validation_split = 0.2,
verbose = 0,
callbacks = list(print_dot_callback)
)
##
## ................................................................................
## ................................................................................
## ................................................................................
## ................................................................................
## ................................................................................
## ................................................................................
## ....................
plot(history, metrics = "mean_absolute_error", smooth = FALSE) +
coord_cartesian(ylim = c(0,5))+ geom_point(aes(alpha=1))+ guides(alpha=F)
min(history$metrics$val_mean_absolute_error)
## [1] 1.041282
c(loss, mae) %<-% (model %>% evaluate(as.matrix(test_std), yTest, verbose = 0))
paste0("Mean absolute error on test set: ", sprintf("%.2f", mae))
## [1] "Mean absolute error on test set: 1.00"
ann_test_pred <- model %>% predict(as.matrix(test_std))
ann.perf <- performance(yTest, ann_test_pred)
ann.perf
## [,1]
## model.rmse 1.3320329
## 0.8977825
## model.mae 1.0034594
## model.r2 0.8054256
perform.table$ann.perf <- ann.perf
We again see that the performance of the ANN model on the test model is almost exactly the same as the multiple linear regression will all the features and the stepwise linear model. Hence, for the sake of simplicity and interpretablity in this case we have to chose the linear model with stepwise selection. However, this doesn’t mean the ANN model built here is the best we can get. There are a number of hyperparameters we can playwith to see if we can improve the models performance. We can change the number of hidden layers and nodes, we can change the activation function and the learning rate, and we can also change the model structure.
Another, machine learning technique we could try is the tree based regression which I will be training next.
Next, will try the prediction using a tree based regression with the rpart package.
library(rpart)
set.seed(1234)
rpart_train <- data.frame(train_std)
rpart_train$Chlorophyl <- yTrain
lakes.rpart<-rpart(Chlorophyl~., data=rpart_train)
lakes.rpart
## n= 962
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 962 8080.29600 4.2904460
## 2) TURB< 0.07309393 536 1889.46500 2.4876570
## 4) SECCHI>=0.627837 275 661.46530 1.5363600
## 8) NTL< -0.9851275 111 206.58060 0.6644478 *
## 9) NTL>=-0.9851275 164 313.38440 2.1264960 *
## 5) SECCHI< 0.627837 261 716.91850 3.4899820
## 10) TURB< -0.8923215 20 76.32661 1.1272760 *
## 11) TURB>=-0.8923215 241 519.67880 3.6860580 *
## 3) TURB>=0.07309393 426 2256.96100 6.5587440
## 6) NTL< 0.6623992 230 778.34720 5.3506650
## 12) NTL< 0.03702325 100 292.80300 4.3990440 *
## 13) NTL>=0.03702325 130 325.32570 6.0826810
## 26) ABUNDANCE< -0.8379097 22 34.05670 4.2553210 *
## 27) ABUNDANCE>=-0.8379097 108 202.84090 6.4549210 *
## 7) NTL>=0.6623992 196 749.03460 7.9763880
## 14) TURB< 0.7057876 65 142.54000 6.5307150 *
## 15) TURB>=0.7057876 131 403.24100 8.6937070
## 30) NTL< 1.659964 95 184.65750 8.1754330 *
## 31) NTL>=1.659964 36 125.72720 10.0613700 *
library(rpart.plot)
rpart.plot(lakes.rpart, digits=3, main="Tree based regression result")
rpart.plot(lakes.rpart, digits = 4, fallen.leaves = T, type=3, extra=101,
main="Tree based regression result", cex=0.85)
The tree regression using rpart chose the TURB feature (which turbidity of the lake) as the root followed by NTL(Total nitrogen in the lake), secchi (measure of transparacy of the lake) and ABUNDANCE (which is abudance of phytoplankton in the lakes). We can test the models performance on the test data as follows. This results makes sense since these parameters are directly related to algal growth and eutrophication of lakes.
lak.rp.pred<-predict(lakes.rpart, data.frame(test_std))
summary(lak.rp.pred)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6644 2.1265 3.6861 4.2457 6.4549 10.0614
rpart.perf <- performance(yTest, lak.rp.pred)
rpart.perf
## [,1]
## model.rmse 1.8360633
## model.cor 0.7953457
## model.mae 1.3843657
## model.r2 0.6303156
Again we see that rpart was not able to yield a better result as compared to the previous models. Next, I will try to train xgboost algorithm, which is an acronym for extrem gradient boosting , a decision trees bsed algorithm.
Now, I will training an xgboost algorithm which is know for its high efficiency, flexibility and portability. Again, similar to ANN, xgoost has a number of tuning parameters to imporve the models performance. For this work I will use commonly used values for regression. The implementation of xgboost algorithm in R is found in the ‘xgoost’ algorithm.
library(xgboost)
dtrain <- xgb.DMatrix(data = xTrain, label = yTrain)
lake.xgb <- xgboost(data = dtrain, max_depth = 6, eta = 0.01, gamma = 0.01, nthread = 3, nfold=5,
nrounds = 1000,eval_metric = "rmse", objective = "reg:linear", verbose = 0)
plot(lake.xgb$evaluation_log, xlab='number of trials',
col="skyblue3", lty=3, type="p", main="RMSE of xgboost on the training data")
Now we have trained model, let us test its performance on the test data.
## Predictions
xbg.preds <- predict(lake.xgb, newdata = xTest)
xgb.perf <- performance(yTest, xbg.preds)
xgb.perf
## [,1]
## model.rmse 1.3799232
## model.cor 0.8896438
## model.mae 1.0032522
## model.r2 0.7911831
perform.table$xgb <- xgb.perf
The performance of xgboost was not significantly different from the other models. However, there is a possibility to improve the performance by further tunning the hyperparameters.
The following table gives the summary of all models test so far. As it can be seen all of the models tried so far had very similar performance. Evenif the performance are not excellent, but I believe they could be considered acceptable.
perform.table %>% kable(digits = 3) %>%
kable_styling(bootstrap_options = "striped", full_width = F, font_size = 12)
| lm.all | lm.step | lm.step.cv | LASSO | lm.small | ann.perf | xgb | |
|---|---|---|---|---|---|---|---|
| model.rmse | 1.369 | 1.389 | 1.3947268 | 1.4310867 | 1.5559103 | 1.3320329 | 1.3799232 |
| model.cor | 0.892 | 0.888 | 0.8766364 | 0.8836414 | 0.8583315 | 0.8977825 | 0.8896438 |
| model.mae | 0.988 | 0.979 | 1.0333327 | 1.0736569 | 1.1863575 | 1.0034594 | 1.0032522 |
| model.r2 | 0.794 | 0.788 | 0.7684066 | 0.7754114 | 0.7345242 | 0.8054256 | 0.7911831 |
Comparing models performance so far
To recap, we can visualize the performance of all the models trained and tested so far as shown below.
models.perf <- data.frame (t(perform.table))
models.perf$models <- rownames(models.perf)
perf.long <- melt(models.perf, id.vars = "models" )
theme_set(theme_bw())
ggplot(perf.long, aes(x=models, y=value)) +
geom_point(size=5) +
geom_segment(aes(x=models, xend=models, y=0, yend=value)) +
labs(title="Models RMSE Chart") + ylab("RMSE")+
theme(axis.text.x = element_text(size=14, angle = 60, hjust = 1),
axis.text.y = element_text(size=14),
axis.title=element_text(size=18,face="bold", color = "darkred"))+
facet_wrap(~ variable)
I have trained and test several well known machine learning algorithms that have been used to solve many complex Data science problems that have been applied in different sectors. The performance I have obtained by training these models can be considered acceptable and satisfactory. And these models can be tested and used for similar applications. In general, all the analysis done including dimensional reduction and feature selection indicate that the problem is truely high dimensional and can to be reduced to a lower dimension. The other performance limitation observed in the models, particularly in the linear models, can be attributed to the existance of outlinears. I believe this can be improved by further including non-linearity and interaction between features in the models. Once such example could be using a generalize additive models (GAM). To improve the performances of ANN and xgboost, further hypertuning of the model parameters can be tried.
In this section, I will apply different unsupervised classification techniques to identify different clusters among lakes. One know way of classifying lakes is based on lakes trophic state. Lakes trophic state is a measure of lake’s total biomass at specific time and space. An indirect measurement is lakes trophic state index (TSI), which uses chlorophyll-a, total phosphorus, secchi depth, and total nitrogen based on Carlson 11 (1977). We can calculate the TSI using the formula derived by Carlson, and we will compare the obtained TSI with clustering algorithms such as t-SNE and kmeans. Based on TSI lakes are classified in to four states–oligotrophic, mesotrophic, eutrophic and hypereutrophic, which is from least nutrient and biomass rich (clean lake) to most rich. To do the TSI clustering I will use only the fours features that are also used to calculate TSI. Finally, I will explore the existance of other clustering between lakes using the entire feature data
First, I will calculate TSI based on the commonly used eperical formula by Carlson, and identify the trophic state of the lakes. The result will be used to compare the performance of the clustering algorithms I will be using.
Now we have the trophic state of the lakes, we can start working on the clustering. Frist, I will use kmeans to do the clusters and compare it with t-SNE.
# Scale the data
lakes_scaled <- data.frame(scale(lakes.df[,1:42]))
lakes_scaled$URBAN <- ifelse(lakes.df$URBAN =="YES",1,0)
lakes_scaled$LAKE_ORIGIN <- ifelse(lakes.df$LAKE_ORIGIN =="NATURAL",1,0)
# Run kmeans using the four features and four clusters
set.seed(3435)
lakes_clusters4<- kmeans(lakes_scaled[,c("NTL","PTL","SECCHI","Chlorophyl")], 4, nstart = 20, iter.max = 30)
Next, I will do the clustering using t-SNE and compare both kmeans and t-SNE with the emperically calculated TSI results.
library(Rtsne)
set.seed(3434)
tsne_lakes <- Rtsne(lakes.df[,c("NTL","PTL","SECCHI","Chlorophyl")],
dims = 2, perplexity=80, verbose=TRUE, max_iter = 500)
## Performing PCA
## Read the 1130 x 4 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 80.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.47 seconds (sparsity = 0.266508)!
## Learning embedding...
## Iteration 50: error is 54.431234 (50 iterations in 0.20 seconds)
## Iteration 100: error is 52.519277 (50 iterations in 0.18 seconds)
## Iteration 150: error is 52.512390 (50 iterations in 0.16 seconds)
## Iteration 200: error is 52.512236 (50 iterations in 0.16 seconds)
## Iteration 250: error is 52.514158 (50 iterations in 0.16 seconds)
## Iteration 300: error is 0.680049 (50 iterations in 0.19 seconds)
## Iteration 350: error is 0.584024 (50 iterations in 0.19 seconds)
## Iteration 400: error is 0.562367 (50 iterations in 0.20 seconds)
## Iteration 450: error is 0.555057 (50 iterations in 0.20 seconds)
## Iteration 500: error is 0.551683 (50 iterations in 0.20 seconds)
## Fitting performed in 1.83 seconds.
Let us plot the results of kmeans and t-SNE and overally them with the calculate TSI values.
clrs2 <- c("darkblue", "deepskyblue3","chartreuse4","black")
trophic.col <- trophic
levels(trophic.col) <- clrs2
trophic.col <- as.character(trophic.col)
par(mfrow=c(1,2), omi=c(.1,.3,.4,.1))
plot(tsne_lakes$Y, t="n", sub = "t-SNE vs calculated TSI", col.sub="brown" )
text(tsne_lakes$Y, labels = trophic.num, col=trophic.col) #, pch = lakes_clusters4$cluster)
legend("topleft", levels(trophic), text.col = clrs2, bg='gray90', cex=0.8, pch=levels(trophic.num))
plot(tsne_lakes$Y, pch=levels(factor(lakes_clusters4$cluster)),
col=clrs2[trophic.num], sub = "kmeans vs calculated TSI", col.sub="brown" )
legend("topleft", levels(trophic), text.col = clrs2, bg='gray90', cex=0.8, pch=levels(trophic.num))
mtext("t-SNE vs K-means Clusters for US lakes", side = 3, cex=1.5, outer = T)
# Check k-means accuracy
caret::confusionMatrix(factor(lakes_clusters4$cluster),trophic.num)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4
## 1 2 173 214 0
## 2 0 0 241 119
## 3 151 52 0 0
## 4 0 0 1 177
##
## Overall Statistics
##
## Accuracy : 0.1584
## 95% CI : (0.1376, 0.181)
## No Information Rate : 0.4035
## P-Value [Acc > NIR] : 1
##
## Kappa : -0.0843
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity 0.013072 0.0000 0.0000 0.5980
## Specificity 0.603889 0.6022 0.6988 0.9988
## Pos Pred Value 0.005141 0.0000 0.0000 0.9944
## Neg Pred Value 0.796221 0.7078 0.5081 0.8750
## Prevalence 0.135398 0.1991 0.4035 0.2619
## Detection Rate 0.001770 0.0000 0.0000 0.1566
## Detection Prevalence 0.344248 0.3186 0.1796 0.1575
## Balanced Accuracy 0.308481 0.3011 0.3494 0.7984
We see here that on the left figure, t-SNE has a strong agreement with the calculated TSI. The figure on the right is a bit complex, the location of the observations are based on t-SNE coordinate, the colors are based on the four trophic stated as calculated by the TSI, and the labels(numbers) are based on kmeans clustering. Thus, we observe that kmeans (also from the confusion matrix) has a very poor agreement with both TSI and t-SNE results.
set.seed(33)
tsne_lakes3d <- Rtsne(lakes.df[,c("NTL","PTL","SECCHI","Chlorophyl")],
dims = 3, perplexity=80, verbose=TRUE, max_iter = 500)
## Performing PCA
## Read the 1130 x 4 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 3, perplexity = 80.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.49 seconds (sparsity = 0.266508)!
## Learning embedding...
## Iteration 50: error is 54.563281 (50 iterations in 0.45 seconds)
## Iteration 100: error is 52.482003 (50 iterations in 0.28 seconds)
## Iteration 150: error is 52.475115 (50 iterations in 0.25 seconds)
## Iteration 200: error is 52.474545 (50 iterations in 0.25 seconds)
## Iteration 250: error is 52.473286 (50 iterations in 0.25 seconds)
## Iteration 300: error is 0.621888 (50 iterations in 0.34 seconds)
## Iteration 350: error is 0.509386 (50 iterations in 0.38 seconds)
## Iteration 400: error is 0.483002 (50 iterations in 0.38 seconds)
## Iteration 450: error is 0.473704 (50 iterations in 0.40 seconds)
## Iteration 500: error is 0.466167 (50 iterations in 0.40 seconds)
## Fitting performed in 3.38 seconds.
tsne3d <- data.frame(tsne_lakes3d$Y, colr = trophic.col)
info <- left_join(select_lakes2[, c("UID","URBAN","LAKE_ORIGIN")], lakes_info12[,c("UID", "NARS_NAME","STATE")])
plot_ly(tsne3d, x=~X1, y=~X2, z=~X3, colors = trophic.col) %>%
add_markers(text= ~paste('Lake name:', info$NARS_NAME, '<br> State:', info$STATE)) %>%
layout(scene = list(xaxis = list(title = 'D1'),
yaxis = list(title = 'D2'),
zaxis = list(title = 'D3')))
As it can be seen from the plot above t-SNE was able to cluster the lakes in the same way as the emperically calcualted TSI levels, however kmeans had a very low performance with prediction accuracy less the 50%.
In this section, I will try to use the entire features and explore is there is another clustering among lakes. I will start with kmeans first, and then use t-SNE to answer this question.
library(cluster)
k=seq(2,100,4)
lakes_clusters=vector("list", length(k))
within.ss=vector("list", length(k))
between.ss=vector("list",length(k))
sil=vector("list",length(k))
set.seed(3456)
for (i in 1:length(k)) {
lakes_clusters[[i]]<- kmeans(lakes_scaled, k[i], nstart = 20, iter.max = 40)
within.ss[[i]] <- lakes_clusters[[i]]$tot.withinss
between.ss[[i]] <- lakes_clusters[[i]]$betweenss
}
Plot the results of kmeans if we can identify clear clusters from the data.
plot(k, within.ss, xlab = "k values", ylim = c(1000,50000), ylab = "sum of squared errors", type="b", lwd=2, pch=18)
lines(k, between.ss, col="royalblue2", type="b", lwd=2, pch=20)
legend("bottomright",c("Total within clusters","Between clusters"), lwd=c(2,2), col=c("black","royalblue2"), pch=c(18,20))
title("Clusters Variance vs Cluster Numbers")
Let us plot for the case when we have two clusters using the function clusplot from
clusplot(lakes.df, lakes_clusters[[1]]$cluster, color=TRUE,
shade=TRUE, labels=2, lines=0,
main = "kmeans with k=2 clusters")
As it can be seen from the above visualizations, kmeans does not yield clear and distinct clusters, but roughly we can see 6 to 10 clusters can be identified. Next, we will use t-SNE if we can identify clear clustering of lakes based on the overall characteristics of the lakes.
set.seed(11)
tsne_lakes2 <- Rtsne(lakes.df, dims = 2, perplexity=80, verbose=TRUE, max_iter = 500)
## Performing PCA
## Read the 1130 x 45 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 80.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.51 seconds (sparsity = 0.267741)!
## Learning embedding...
## Iteration 50: error is 60.389727 (50 iterations in 0.21 seconds)
## Iteration 100: error is 56.506053 (50 iterations in 0.19 seconds)
## Iteration 150: error is 56.492306 (50 iterations in 0.20 seconds)
## Iteration 200: error is 56.463207 (50 iterations in 0.20 seconds)
## Iteration 250: error is 56.452528 (50 iterations in 0.20 seconds)
## Iteration 300: error is 0.820401 (50 iterations in 0.17 seconds)
## Iteration 350: error is 0.746273 (50 iterations in 0.17 seconds)
## Iteration 400: error is 0.727430 (50 iterations in 0.17 seconds)
## Iteration 450: error is 0.719992 (50 iterations in 0.17 seconds)
## Iteration 500: error is 0.715898 (50 iterations in 0.17 seconds)
## Fitting performed in 1.86 seconds.
plot(tsne_lakes2$Y, main="t-SNE Clusters for US lakes", col=lakes.df$URBAN)
legend("topright",c("Urban", "Rular"), col=c("black","red"), pch = c(1,1))
We can see that t-SNE has identified two groups of clusters but they don’t corresponde to urban versus rular lakes. In addition, other than saying there are two groups, it is difficult understand what these two clusters mean. I will do additional clustering using hierarchical clustering and compare the result with t-SNE in the next section.
hc.complete = hclust (dist(lakes_scaled), method ="complete")
plot(hc.complete ,main =" Complete Linkage ", xlab ="", sub ="", cex =.6, labels=info$STATE)
rect.hclust(hc.complete, k = 4, border = 2:5)
# Cut tree into 4 groups
sub_grp <- cutree(hc.complete, k = 4)
table(sub_grp)
## sub_grp
## 1 2 3 4
## 559 517 53 1
We notice here that cluster 4 contains only 1 lake and cluster 3 contains 53. The majority of the lakes are found in the first two clusters. From this we could say we have two major clusters.
We can also use the function fviz_cluster from the factoextra package to get a 2d visualization of the four clusters.
library(factoextra)
fviz_cluster(list(data = lakes_scaled[,-43], cluster = sub_grp)) # URBAN had 0 variance, remove it
Cluster 1 and 3 seem more distinct when cluster 2 could be split into these two clusters. One observation that falls into cluster 4 could anommaly or could be in cluster 3 if we see it from low dimension only.
We can also try to get the optimal number of hierarchical clusters. For this we can use fviz_nbclust function from factoextra package as shown below.
fviz_nbclust(lakes_scaled, FUN = hcut, method = "wss")
fviz_nbclust(lakes_scaled, FUN = hcut, method = "silhouette")
Based on the within variance measure it is not clear how many clusters are optimal, but based the silhouette measure the optimal clusters are 2, which is also in agreement with the t-SNE algorithm.
Let use 2 clusters and get thier labels which will use to compare with t-SNE results.
sub_2grp <- cutree(hc.complete, k = 2)
fviz_cluster(list(data = lakes_scaled[,-43], cluster = sub_2grp))
table(sub_2grp)
## sub_2grp
## 1 2
## 1076 54
We can also use ward hierarchical clustering method as a comparison.
hc.wd = hclust (dist(lakes_scaled), method ="ward.D")
plot(hc.wd ,main ="ward.D method ", xlab ="", sub ="", cex =.6)
rect.hclust(hc.wd, k = 2, border = 2:5)
sub_2g.wd <- cutree(hc.wd, 2)
table(sub_2g.wd)
## sub_2g.wd
## 1 2
## 742 388
fviz_cluster(list(data = lakes_scaled[,1:42], cluster = sub_2g.wd))
We can also compare the results of t-SNE and hierarchical clustering with the aid of visualization as follows.
par(mfrow=c(1,2), oma=c(1,1,2,0.2))
plot(tsne_lakes2$Y, col=sub_2g.wd, sub="With ward.D method")
legend("topright",c("Cluster1", "Cluster2"), col=c("black","red"), pch = c(1,1), title = "Hierarchical clustering")
plot(tsne_lakes2$Y, col=sub_2grp, sub="With complete linkage method")
legend("topright",c("Cluster1", "Cluster2"), col=c("black","red"), pch = c(1,1), title = "Hierarchical clustering")
mtext("t-SNE vs Hierarchical Clustering of US Lakes for the Year 2012",side = 3, outer = T, cex = 1.5)
As it can be seen from the above visualization, the two clustering methods do not yield similar result similar to what was observed between t-SNE and kmeans.
I have done clustering of lakes using several techniques including kmeans, hierarchical clustering and t-SNE. Again the results indicate the complexity of the problem. However, we have got useful insight how we could cluster the lakes, especially the results obtained from t-SNE indicate that we have clear 2 cluster of lakes that exist. Further analysis can be done to identify what distinguishes this two clusters.
I would like to thank my wife, Sara, for her support and patience during this work. I would like also to acknowledge my Prof. Ivo D. Dinov, for his insightful teaching and inspiration for the entire course inluding the project.
USEPA. 2017. National Lakes Assessment 2012: Techical Report. EPA 841-R-16-114. U.S. Environmental Protection Agency, Washington, D.C.↩
Amina I. Pollard, Stephanie E. Hampton, and Dina M. Leech. The Promise and Potential of Continental-Scale Limnology Using the U.S. Environmental Protection Agency’s National Lakes Assessment. Association for the Sciences of Limnology and Oceanography. (2018)↩
USEPA. 2017. National Lakes Assessment 2012: Techical Report. EPA 841-R-16-114. U.S. Environmental Protection Agency, Washington, D.C.↩
Bhateria, R. & Jain, D. Sustain. Water quality assessment of lake water: a review. Water Resour. Manag. (2016) 2: 161. https://doi.org/10.1007/s40899-015-0014-7↩
BoqiangQin. Lake eutrophication: Control countermeasures and recycling exploitation. Ecological Engineering. (2009), pages 1569-1573↩
Jeremy Mack. Eutrophication. Retrieved from http://www.lakescientist.com/eutrophication/↩
Bhateria, R. & Jain, D. Sustain. Water quality assessment of lake water: a review. Water Resour. Manag. (2016) 2: 161. https://doi.org/10.1007/s40899-015-0014-7↩
USEPA. Indicators: Chlorophyll a. Retrieved from https://www.epa.gov/↩
Bhateria, R. & Jain, D. Sustain. Water quality assessment of lake water: a review. Water Resour. Manag. (2016) 2: 161. https://doi.org/10.1007/s40899-015-0014-7↩
Biswajit Bhagowati, Kamal Uddin Ahamad. A review on lake eutrophication dynamics and recent developments in lake modeling. Ecological Engineering. (2018), Pages 1569-1573↩
Robert E. Carlson. A trophic state index for lakes. Limnology and Oceanography. (1977) 2. Pages 361-369. https://doi.org/10.4319/lo.1977.22.2.0361↩